Data Solutions: Azure Data Factory

Showing posts with label Azure Data Factory. Show all posts

Azure Data Factory SqlErrorNumber 8672: A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times

Scenario: We have a copy activity in the data factory pipeline that, copies data from Azure Blob Storage (ADLS) to Azure SQL. in the sink, we are using SQL stored procedure to upsert the data using Merge statement. In the stored procedure, we are passing the Table type that holds the data from source.

While running the pipeline, it fails giving the below error message:

Error: The MERGE statement attempted to UPDATE or DELETE the same row more than once.
{ "errorCode": "2200", "message": "ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A database operation failed with the following error: 'The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.',Source=,''Type=System.Data.SqlClient.SqlException,Message=The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.,Source=.Net SqlClient Data Provider,SqlErrorNumber=8672,Class=16,ErrorCode=-2146232060,State=1,Errors=[{Class=16,Number=8672,State=1,Message=The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.,},],'", "failureType": "UserError", "target": "Copy data to Azure SQL", "details": [] }

Solution: On Analyzing, we found that in our SQL table, we had one column set to not null that we are no longer using in our database. after deleting this column from the SQL table, our pipeline ran successfully.

If we require this column, then we would have checked if the source is sending the data in this column or not. if the data was not regular (we see some nulls), then we would have edited the schema of the table to make this column allow Nulls.

Prevent Azure data factory from running multiple instances of a single pipeline at same time

Today we will learn on how to prevent Azure data factory from starting another instance of the pipeline when previous one is still running

Scenario: You have a pipeline that has been set to run every 5 minutes. suddenly, you notice that the pipeline has started to take more than 5 minutes and the next instance of the pipeline exucutes after 5 minutes. this is giving you problems. Follow the steps below:

Steps:

Go to your pipeline

Click the grey area

Go to General > Concurrency

Set to 1
This will set the number of simultaneous runs of that pipeline to 1
Now only 1 pipeline instance can be run at one time. all the other pipeline runs will be put on waiting stage

Azure data factory - datetime error in copy activity when loading data in SQL database

Today we will learn on how to deal with loading datetime fields in the database using Data Factory

Scenario: Suppose you have a datetime field in source that you are trying to load into database through data factory v2. you sometime gets an error that datetime is converted into datetime2 or datetime misses loading the datetime column into database. follow the below steps:

Steps:

In the copy activity, change the Mapping of the source to string.

Use stored procedure approach to load the data into SQL database
In the Sink section of copy activity, specify the stored procedure name, table type, and table type parameter

In the stored procedure, convert string into datetime
Link to stored procedure approach: https://www.datastackpros.com/2020/03/azure-data-factory-upsert-using.html

Run the pipeline

Alternate solution:

Try changing the Mapping of the source to Datetime

specify the date format "yyyy-MM-dd"

Run the pipeline
it will convert the date into the specified format

Data Factory - exception handling with files with no data rows but headers only

Today we will learn on how to apply exception handling when loading csv files into Azure SQL database that has no data rows but one header line

Scenario: Suppose you want to load data into Azure SQL incrementally that from csv files generated every N minutes into Azure data lake storage (ADLS). You want to apply exception handling to make sure when empty files comes in, data factory will not fail.

Solution: Please make sure you are generating empty files (from the source) when there is no data (i.e.: without header row). Data factory will succeed if the file is completely blank. This is out of the box data factory's functionality. In case of header row, it will fail. So tweak your source logic.

Steps after source file is generated:

Create a Get Metadata task in data factory

In the Field list, specify an argument "exists"
We will use this argument in the if condition to see if the file exists in the folder.

This is done to prevent from error "path doesn't exist / invalid source"

Drag "If Condition" task and connect it to the Get Metadata task (as shown above in the first picture)

Specify an expression to check if the file exists (dataset path is valid)
Expression: @and(bool(activity('check if file exists').output.exists),bool(true))

In the True condition, insert your incremental load logic.

copy activity to copy data from source to target

You are done.
Run your Pipeline.
Here after successfully loading data into SQL, i am moving the file to an archive ADLS container.

Note: Above image shows that now both blank files (0 bytes) as well as files with data rows are being copied to the Archive folder after successful load in SQL.

In case of Blank file without header row, Data factory will succeed and proceed to the next task. No data is written in the database and data factory pipeline is successful.

Azure Devops - Connect Data Factory to Azure DevOps Git repository

Today we will learn on how to set the code repository in Data Factory and link it with Azure Git

There are two ways to set up code repository in Data Factory

Git
Azure Git

Today, we will be using Azure Git as a code repository in Data Factory.

To Link Azure Git to Data Factory, you first have to setup Azure DevOps:

Go to Azure Devops (https://aex.dev.azure.com/me?mkt=en-US)
Create an Organization (if not created already)
Add a project in that organization
Assign team members (if any) and give them the appropriate permissions:

For team members to create Git repos, they should have project admin rights
If you want them to only contribute to existing project/Git Repo but not have them as an admin, assign them the contributor rights.
Now, go inside the project and create a Test git repo

Once you have created an organization, project and a repo inside the project, you are ready to link the Azure code repository to Data Factory.

Your Azure Devops structure should look like below:

Let's Link Code repository to Data Factory.

Go to your data factory and click "Author and Monitor" in the Overview Section.

It will open a new link on the browser and take you inside the data factory instance.
In the data factory Overview section, click "Setup code repository"
Select the below options

Azure Git as Code repository as repository type
Your Azure Devops account (Organization)
Your Project (created within organization)
Your Git repo (created inside project)
Collaboration Branch: you can choose master.

Once you publish in data factory, it will create a new branch "Adf_Publish" where it will store it's ARM deployment templates.

Click Apply.
You are done.

Go inside the data factory by clicking "Author" and you will see your code repository linked.

Enjoy :)

Azure Data Factory: Upsert using Data Flows

Today we will learn on how to perform upsert in Azure data factory (ADF) using data flows

Scenario:

We will be ingesting a csv stored in Azure Storage (ADLS V2) into Azure SQL by using Upsert method

Steps:

1. Create two connections (linked Services) in the ADF:

1. One for the csv stored in ADLS

2. One for the target Azure SQL

Read full article »

Azure Data Factory - Upsert using Pipeline approach instead of data flows

Today we will learn on how to perform upsert in Azure data factory (ADF) using pipeline approach instead of using data flows

Task:

We will be loading data from a csv (stored in ADLS V2) into Azure SQL with upsert using Azure data factory. Once the data load is finished, we will move the file to Archive directory and add a timestamp to file that will denote when this file was being loaded into database

Benefits of using Pipeline:

As you know, triggering a data flow will add cluster start time (~5 mins) to your job execution time. Although, you can make use of the Time to live (TTL) setting in your Azure integration runtime (IR) to decrease the cluster time but, still a cluster might take around (2 mins) to start a spark context. so, if you are thinking of creating a real time data load process, the pipeline approach will work best as it does not need a cluster to run and can execute in seconds.

Pre-requisites

An Azure Data Factory resource
An Azure Storage account (Gen 2)
An Azure SQL Database

Read full article »

Menu