Data Solutions: AWS EMR: Read CSV file from S3 bucket using Spark dataframe

AWS EMR: Read CSV file from S3 bucket using Spark dataframe

Today we will learn on how to use spark within AWS EMR to access csv file from S3 bucket

Steps:

Create a S3 Bucket and place a csv file inside the bucket

SSH into the EMR Master node

Get the Master Node Public DNS from EMR Cluster settings
In windows, open putty and SSH into the Master node by using your key pair (pem file)

Type "pyspark"

This will launch spark with python as default language

Create a spark dataframe to access the csv from S3 bucket

Command: df.read_csv("<S3 path to csv>",header=True,sep=',')

Type "df_show()" to view the results of the dataframe in tabular format

You are done

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)