Today we will learn on how to perform upsert in Azure data factory (ADF) using pipeline approach instead of using data flows
Task:
- We will be loading data from a csv (stored in ADLS V2) into Azure SQL with upsert using Azure data factory. Once the data load is finished, we will move the file to Archive directory and add a timestamp to file that will denote when this file was being loaded into database
Benefits of using Pipeline:
- As you know, triggering a data flow will add cluster start time (~5 mins) to your job execution time. Although, you can make use of the Time to live (TTL) setting in your Azure integration runtime (IR) to decrease the cluster time but, still a cluster might take around (2 mins) to start a spark context. so, if you are thinking of creating a real time data load process, the pipeline approach will work best as it does not need a cluster to run and can execute in seconds.
Pre-requisites
Steps:
- Create a .csv file with some dummy data and upload it into Azure Storage (ADLS V2)
- In ADF, Create a connection to ADLS as source
- In ADF, Create a connection to Azure SQL as target
- In ADF, Create a dataset for source csv by using the ADLS V2 connection
- In ADF, Create a dataset for target csv by using the ADLS V2 connection that will be used to put the file into Archive directory
- In the connection, add a dynamic parameter by specifying the Archive directory along with current timestamp to be appended to the file name
6. In ADF, Create a dataset for target SQL by using Azure SQL connection
1. Enter your SQL table name here
7. Log onto Azure SQL and create the below objects:1. SQL Table
2. Upsert Stored procedure
3. Data type
8. Create a data pipeline in the Azure Data Factory (ADF) and drag the below tasks in the
pipeline:
1. Copy activity task
1. Configure source to ADLS connection and point to the csv file location
2. Configure sink to SQL database connection
1. Enter upsert stored procedure name
2. Enter Table Type
3. Enter Table Type parameter name
4. In Sink dataset, enter the target SQL dataset name that you created above
3. Go to mapping and click import schema
2. Copy activity task
1. Configure source to the source ADLS dataset created above
2. Configure sink to the target ADLS dataset created above
3. Delete procedure task
1. In connection, Configure the dataset to the source ADLS dataset
2. Untick the recursively check mark (this step to make sure you are only deleting
the source csv file)
the source csv file)
9. Your pipeline should look like above.
10. You are done. Trigger the pipeline and check in SQL the updates / new records.
Note: You will see a drastic decrease in the pipeline execution run time. :)
No comments:
Post a Comment