Data Solutions: AWS EMR

Showing posts with label AWS EMR. Show all posts

Showing posts with label AWS EMR. Show all posts

AWS EMR: Read CSV file from S3 bucket using Spark dataframe

Today we will learn on how to use spark within AWS EMR to access csv file from S3 bucket

Steps:

Create a S3 Bucket and place a csv file inside the bucket

SSH into the EMR Master node

Get the Master Node Public DNS from EMR Cluster settings
In windows, open putty and SSH into the Master node by using your key pair (pem file)

Type "pyspark"

This will launch spark with python as default language

Create a spark dataframe to access the csv from S3 bucket

Command: df.read_csv("<S3 path to csv>",header=True,sep=',')

Type "df_show()" to view the results of the dataframe in tabular format

You are done

AWS EMR: Create a hive table with csv file stored in a S3 bucket

Today we will learn on how to create a hive table (inside EMR cluster) with csv file stored in a S3 bucket

Steps:

Go to your EMR cluster and copy the "Master Public DNS"

This is the public ip of your master node

if you are using a windows machine, download and install putty software for doing SSH into the master node
Open the putty and login with your AWS key-value pair (pem file)

In the login as: type hadoop
you are now logged in the master node

Create a S3 bucket and place a csv file in the bucket.

for this test, i am using a csv file from:

https://support.staffbase.com/hc/en-us/articles/360007108391-CSV-File-Examples

Now, you have to create a script for creating a hive database and a table

type vi <scriptname>

It will open vi editor

Press "i" for writing in the vi editor.

copy and paste your script
press esc

type :wq
Hit enter
it will write the script in the file and take you out of the vi editor

run the script using "hive -f <scriptname>" command
you are done.

A hive database and a table has been created

to verify the results. go to Hue

for first time, it will ask you to enter a username and password
this will become your hue credentials for login

write a select statement to verify the data in the table that you created above
You will see the data inside your table

How to create AWS EMR cluster with Hadoop, Hive and Spark on it

Today we will learn how to create an AWS EMR hadoop cluster with Spark on it

Steps:

Go to EMR and create a cluster
Select Core Hadoop option under Applications

Click "Go to advance options"
Select Spark under software configuration

Additionally, select Zeppelin

Zeppelin lets you use notebook to write spark queries/scripts

Click Next

Under General cluster settings, tick mark EMRFS consistent view

Consistent view provides consistency checking for list and read-after-write (for new put requests) for objects in Amazon S3

Create Key Value pair

To create an Amazon EC2 key pair:

Go to the Amazon EC2 console
In the Navigation pane, click Key Pairs

On the Key Pairs page, click Create Key Pair
In the Create Key Pair dialog box, enter a name for your key pair, such as, mykeypair

Click Create
Save the resulting PEM file in a safe location

Specify Key Pair in the cluster settings
Click Next and create the cluster

Wait for the cluster to start
Once the cluster is up and running. you can now SSH to the cluster.

Subscribe to: Posts (Atom)