Data Solutions: Convert CSV to Parquet using AWS Glue

Convert CSV to Parquet using AWS Glue

Today we will learn on how to convert CSV to Parquet using AWS Glue ETL Job

Steps:

Create a new Glue ETL Spark Job

Select the source data source

The data table should be listed in the Glue Catalog table
You should create a Glue crawler to Store the CSV Metadata table in Glue Catalog prior to this task if you haven't done that.

The path should be the folder stored in S3 not the file. so, if you have file structure CSVFolder>CSVfile.csv, you have to select CSVFolder as path not the file csvfile.csv

Choose a transform type = Change Schema

Choose a data target

Format = Parquet
Select "Create tables in your data target"

Create a separate folder in your S3 that will hold parquet. In glue, you have to specify one folder per file (one folder for csv and one for parquet)
The path should be the folder not the file. so, if you have file structure ParquetFolder>Parquetfile.parquet. You have to select ParquetFolder as path

Give the Target path of S3 folder where you want the parquet to be stored

Map the source columns to Target (it will be auto mapped for you). Just review the column mappings.

Click "Save job and edit script"

A visual mapping diagram will be created for you along with the pyspark code on the right side.

Run the job

Now, go to the S3 Folder and check that if the parquet file has been created or not.

You can also create a glue crawler to store parquet file's metadata in glue catalog and then query the data in athena
Alternatively, you can also view the data in S3 preview

In some case you may get an error "Fatalexception: unable to parse file filename.csv"

It means that your csv is not UTF-8 formatted right.
To resolve the error, convert your csv to UTF-8 format
You can follow steps to convert csv to utf-8 from here

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)