Showing posts with label AWS EMR. Show all posts
Showing posts with label AWS EMR. Show all posts

AWS EMR: Read CSV file from S3 bucket using Spark dataframe

Today we will learn on how to use spark within AWS EMR to access csv file from S3 bucket

Steps:
  • Create a S3 Bucket and place a csv file inside the bucket
    
  • SSH into the EMR Master node
    • Get the Master Node Public DNS from EMR Cluster settings
    • In windows, open putty and SSH into the Master node by using your key pair  (pem file)
  • Type "pyspark"
    • This will launch spark with python as default language
  • Create a spark dataframe to access the csv from S3 bucket
    • Command: df.read_csv("<S3 path to csv>",header=True,sep=',')
  • Type "df_show()" to view the results of the dataframe in tabular format
  • You are done



AWS EMR: Create a hive table with csv file stored in a S3 bucket

Today we will learn on how to create a hive table (inside EMR cluster) with csv file stored in a S3 bucket

Steps:
  • Go to your EMR cluster and copy the "Master Public DNS"
    • This is the public ip of your master node
  • if you are using a windows machine, download and install putty software for doing SSH into the master node
  • Open the putty and login with your AWS key-value pair (pem file)
  • In the login as: type hadoop 
  • you are now logged in the master node
  • type vi <scriptname>
    • It will open vi editor
  • Press "i" for writing in the vi editor.
  • copy and paste your script
  • press esc
    • type :wq
    • Hit enter
    • it will write the script in the file and take you out of the vi editor
  • run the script using "hive -f <scriptname>" command
  • you are done.
    • A hive database and a table has been created
  • to verify the results. go to Hue
    • for first time, it will ask you to enter a username and password
    • this will become your hue credentials for login
  • write a select statement to verify the data in the table that you created above
  • You will see the data inside your table

How to create AWS EMR cluster with Hadoop, Hive and Spark on it

Today we will learn how to create an AWS EMR hadoop cluster with Spark on it
Steps:
  • Go to EMR and create a cluster
  • Select Core Hadoop option under Applications

  • Click "Go to advance options"
  • Select Spark under software configuration
    • Additionally, select Zeppelin
      • Zeppelin lets you use notebook to write spark queries/scripts
  • Click Next
  • Under General cluster settings, tick mark EMRFS consistent view
    • Consistent view provides consistency checking for list and read-after-write (for new put requests) for objects in Amazon S3
  • Create Key Value pair
    • To create an Amazon EC2 key pair:
      • On the Key Pairs page, click Create Key Pair
      • In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair
      • Click Create
      • Save the resulting PEM file in a safe location
  • Specify Key Pair in the cluster settings
  • Click Next and create the cluster
  • Wait for the cluster to start
  • Once the cluster is up and running. you can now SSH to the cluster.