AWS EMR: Read CSV file from S3 bucket using Spark dataframe

Today we will learn on how to use spark within AWS EMR to access csv file from S3 bucket

Steps:
  • Create a S3 Bucket and place a csv file inside the bucket
    
  • SSH into the EMR Master node
    • Get the Master Node Public DNS from EMR Cluster settings
    • In windows, open putty and SSH into the Master node by using your key pair  (pem file)
  • Type "pyspark"
    • This will launch spark with python as default language
  • Create a spark dataframe to access the csv from S3 bucket
    • Command: df.read_csv("<S3 path to csv>",header=True,sep=',')
  • Type "df_show()" to view the results of the dataframe in tabular format
  • You are done



No comments:

Post a Comment