Today we will learn on how to convert CSV to Parquet using AWS Glue ETL Job
Steps:
- Create a new Glue ETL Spark Job
- Select the source data source
- The data table should be listed in the Glue Catalog table
- You should create a Glue crawler to Store the CSV Metadata table in Glue Catalog prior to this task if you haven't done that.
- The path should be the folder stored in S3 not the file. so, if you have file structure CSVFolder>CSVfile.csv, you have to select CSVFolder as path not the file csvfile.csv
- Choose a transform type = Change Schema
- Choose a data target
- Format = Parquet
- Select "Create tables in your data target"
- Create a separate folder in your S3 that will hold parquet. In glue, you have to specify one folder per file (one folder for csv and one for parquet)
- The path should be the folder not the file. so, if you have file structure ParquetFolder>Parquetfile.parquet. You have to select ParquetFolder as path
- Give the Target path of S3 folder where you want the parquet to be stored
- Map the source columns to Target (it will be auto mapped for you). Just review the column mappings.
- Click "Save job and edit script"
- A visual mapping diagram will be created for you along with the pyspark code on the right side.
- Run the job
- Now, go to the S3 Folder and check that if the parquet file has been created or not.
- You can also create a glue crawler to store parquet file's metadata in glue catalog and then query the data in athena
- Alternatively, you can also view the data in S3 preview
- In some case you may get an error "Fatalexception: unable to parse file filename.csv"
- It means that your csv is not UTF-8 formatted right.
- To resolve the error, convert your csv to UTF-8 format
- You can follow steps to convert csv to utf-8 from here
No comments:
Post a Comment