Convert CSV to Parquet using AWS Glue

Today we will learn on how to convert CSV to Parquet using AWS Glue ETL Job
Steps:
  • Create a new Glue ETL Spark Job
  • Select the source data source
    • The data table should be listed in the Glue Catalog table
    • You should create a Glue crawler to Store the CSV Metadata table in Glue Catalog prior to this task if you haven't done that.
      • The path should be the folder stored in S3 not the file. so, if you have file structure CSVFolder>CSVfile.csv,  you have to select CSVFolder as path not the file csvfile.csv
  • Choose a transform type = Change Schema
  • Choose a data target
    • Format = Parquet
    • Select "Create tables in your data target"
      • Create a separate folder in your S3 that will hold parquet. In glue, you have to specify one folder per file (one folder for csv and one for parquet)
      • The path should be the folder not the file. so, if you have file structure ParquetFolder>Parquetfile.parquet. You have to select ParquetFolder as path
    • Give the Target path of S3 folder where you want the parquet to be stored
  • Map the source columns to Target (it will be auto mapped for you). Just review the column mappings.
    • Click "Save job and edit script"
  • A visual mapping diagram will be created for you along with the pyspark code on the right side.
  • Run the job
  • Now, go to the S3 Folder and check that if the parquet file has been created or not.
    • You can also create a glue crawler to store parquet file's metadata in glue catalog and then query the data in athena
    • Alternatively, you can also view the data in S3 preview
  • In some case you may get an error "Fatalexception: unable to parse file filename.csv"
    • It means that your csv is not UTF-8 formatted right.
    • To resolve the error, convert your csv to UTF-8 format
    • You can follow steps to convert csv to utf-8 from here

Comments