Menu

AWS GLUE VS AWS DATA PIPELINE - Which one to choose ?

Today we will try to understand the difference between AWS Glue and AWS Data Pipeline
Are you considering on designing your ETL pipeline in AWS cloud ? the below table will help you understand which AWS ETL service to choose according to your needs:

AWS GLUE

AWS Data Pipeline

Definition

Serverless
•A web service that helps you create complex data pipelines. Developers have to rely on EC2 instances to execute tasks in a data pipeline as it spins up an EC2 instance to run the job and terminate the EC2 instance after the job is completed

Resiliency

Fault tolerant, Scalable, Highly available and Distributed
Fault tolerant, Highly available, Scalable and Distributed

ETL Design

GUI Based as well as developer friendly. It allows developers to write ETL transformation code using pyspark
GUI Based with pre defined ETL templates that allows making complex pipelines quick and easy using drag and drop functionality.

Pricing

Cost effective. You have to pay only for the execution time (around $0.44 per hour per DPU)
Low frequency model can cost around $0.66 per month, while high frequency model can cost around $1 per month per job execution (each activity)

Data Sources

Supports a lot more data sources by allowing developers the flexibility to import libraries in python to define the data sources that are not pre-defined
Have to work with pre-defined data sources that are available within data pipeline

Scheduling

Support event driven ETL pipeline trigger
Supports three type of triggers (Scheduled, Conditional, and On-demand)

Streaming

Serverless Streaming for making continuous ingestion pipelines for preparing streaming data. Can consume data from streaming sources like Kinesis and Kafka, clean and transform on the fly and make it available for analysis in seconds.

Any Comments / Thoughts much appreciated!

No comments:

Post a Comment