Azure Data Factory Linked Services Parameterization
Linked services (LS) define the connection information to external resources, like connection strings. In Azure Data Factory (ADF) there are, currently, close to 100 different LS.
These connections can be parameterized and values passed at runtime. For some of them, the parameterization can be done using the ADF UI directly . For the remaining LS, that parameterization can be achieved by manipulating the LS JSON.
In this blog post, we’ll go through both options.
Table of Contents
- Prerequisites
- Why parameterize a Linked Service?
- How does it work?
- Building the Demo Scenario
- Executing and Testing the Demo Scenario
- Wrapping Up
Prerequisites
If you want to follow the examples described in this post, I suggest you follow the instructions available here which will deploy all the objects required for this blog post. All you need is an Azure subscription.
If you don’t want to use that option, make sure you have the following items available:
- An Azure subscription. You may register for a free Azure subscription if you don’t have one yet
- An Azure Resource Group where you will place the resources created. Please refer to Create a Resource Group
- An Azure Data Factory instance. Please refer to Create a Data Factory to see how to create a new instance using the portal
- Two Azure SQL Servers, each with a basic database (cheapest option). Please refer to Create a Single Database. Also, make sure to allow all Azure services in the server-level firewall rules, as well as your own IP as seen here
- Two Azure Data Lake Storage Gen2, each with a container created. Please refer to Create a Data Lake Storage Account
- Upload to one of the storage accounts created, the following file into the container created in the previous step
- Create in both databases the following table and insert these records into just one of them
- Assign to the Azure Data Factory managed instance, the role Storage Blob Data Contributor in the 2 storage accounts. Please refer to Assign Azure roles
Why parameterize a Linked Service?
If you use ADF to orchestrate the movement of data between several locations, it’s quite easy to start accumulating a lot of different LS. You can easily use 4 or 5 Azure SQL Databases, 4 or 5 storage accounts, 4 or 5 REST APIs, etc. This makes the management of an ADF instance harder, not only from a development perspective but also when it comes to migrate objects to other environments.
This is the reason why the parameterization of LS is good. We can create only one LS per different type of connection and by making use of parameters, define which location we’re trying to reach at runtime. These parameters can easily be defined and controlled through metadata and incorporated into the data flow process.
How does it work?
The capability of defining parameters in our LS to make them dynamic means that all subsequent components (datasets, activities and pipelines) will inherit that dynamic capability. It will be through a set of variables passing between each component that we define what is going to be executed at runtime.
Looking at the image above, assuming a Data Lake Storage Gen2 example, we have the following: - A parameter LS_AccountURL defined at LS level which is then assigned to the attribute URL on the LS - A parameter DS_AccountURL defined at dataset level which is assigned to the dataset LS property LS_AccountURL - Two parameters, PL_SourceAccountURL and PL_TargetAccountURL, defined at pipeline level which will be used in a copy activity to define the dataset property DS_AccountURL
When we execute the pipeline, we assign values to the 2 pipeline parameters which will then be passed to the copy activity (and any other activities that may make use of them), passed to the dataset and, finally, passed to the LS. This allows us to reach 2 different storage accounts but have only 1 LS created.
Building the Demo Scenario
The examples we’re going to build are simple, we’ll be moving some data between 2 Azure SQL Databases and also copy a .csv file from one storage account to another.
We’ll parameterize the Azure SQL Database LS by using ADF UI and parameterize the Storage Account LS by manipulating the JSON directly. After that, we’ll create 2 datasets (one for each type of linked service) and 2 pipelines to move the data.
Parameterize the Azure SQL Database Linked Service
If you used the GitHub content to deploy the objects required to follow this blog post, you’ll have two LS already created: one for an Azure SQL Database and another one for an Azure Data Lake Storage Gen2 (if you didn’t, you’ll need to create them).
- We’ll start by editing the SQL Database connection by clicking on the object.
- Create 4 parameters named ServerName, DatabaseName, Username and Password. After that, fill in the details of the LS as seen in the image below and click on the Apply button.
- Fully qualified domain name - @{linkedService().ServerName}
- Database name - @{linkedService().DatabaseName}
- Authentication type - leave default (SQL Authentication)
- User name - @{linkedService().Username}
- Password - @{linkedService().Password}
Parameterize the Azure Data Lake Storage Linked Service
We’re going to edit this specific LS by manipulating directly the JSON defining the object.
- Click on the Code button to see the JSON definition of the object.
{
"name": "LS_DataLakeStorage",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://youraccountnamehere.dfs.core.windows.net"
},
"annotations": []
}
}
- To parameterize this LS, we’ll need to add a parameter, alongside its datatype, to the JSON definition and replace the url value with the name of that parameter. The definition will look like what you can see below. Click on the Apply button.
{
"name": "LS_DataLakeStorage",
"properties": {
"parameters": {
"AccountUrl": {
"type": "string"
}
},
"annotations": [],
"type": "AzureBlobFS",
"typeProperties": {
"url": "@{linkedService().AccountUrl}"
}
},
"type": "Microsoft.DataFactory/factories/linkedservices"
}
Create an Azure SQL Database Dataset
- Go to the Author page and select the option to create a New dataset. Once the window opens, choose the Azure SQL Database option and click on Continue.
- On the next screen, name the dataset GenericSQLTable and choose the LS already created, LS_SQLServer. Click on the OK button.
- Once you click on the OK button, a new tab will open with the new dataset just created. The first thing you’ll immediately notice is that the dataset has 4 LS properties corresponding to the 4 parameters we used in the definition of the LS. Choose the Parameters tab and create 5 parameters:
- LS_ServerName
- LS_DatabaseName
- LS_Username
- LS_Password
- TableName
- Go back to the Connection tab and input the 5 parameters we just defined as seen below.
Create an Azure Data Lake Storage Gen2 Dataset
- Go to the Author page and select the option to create a New dataset. Once the window opens, choose the Azure Data Lake Storage Gen2 option and click on Continue.
- On the next screen, select the option Delimited Text and click on the Continue button.
- On the next screen, name the dataset GenericDelimitedFile and choose the LS already created, LS_DataLakeStorage. Click on the OK button.
- As before, a new tab will open with the new dataset just created with 1 LS property. Choose the Parameters tab and create 3 parameters:
- LS_AccountUrl
- ContainerName
- FileName
- Go back to the Connection tab and input the 3 parameters we just defined as seen below. Leave all other options as default.
- Click the Publish button to save the 2 datasets just created. When a new window opens on the right-side, click the Publish button again.
Create Pipeline to Copy SQL Database Data
- Go to the Author page and select the option to create a New pipeline. A new tab will open with the new pipeline created.
- Add a Copy data activity to the pipeline.
- On the Parameters tab of the pipeline, define the following 10 parameters:
- SourceServer
- SourceDatabase
- SourceUsername
- SourcePassword
- SourceTable
- TargetServer
- TargetDatabase
- TargetUsername
- TargetPassword
- TargetTable
- Click on the Copy data activity in the pipeline and in the Source tab of the activity, select the GenericSQLTable as the source dataset and input the following parameters we just defined as the Dataset properties (see image below)
- In the Sink tab of the Copy data activity, select the GenericSQLTable as the sink dataset and input the following parameters we just defined as the Dataset properties (see image below)
Create Pipeline to Copy Storage Account Blob
As in the previous section, create a new pipeline and add a Copy data activity to it.
On the Parameters tab of the pipeline, define the following 6 parameters:
- SourceAccountUrl
- SourceContainer
- SourceFileName
- TargetAccountUrl
- TargetContainer
- TargetFileName
- Click on the Copy data activity in the pipeline and in the Source tab of the activity, select the GenericDelimitedFile as the source dataset and input the following parameters we just defined as the Dataset properties (see image below).
- In the Sink tab of the Copy data activity, select the GenericDelimitedFile as the sink dataset and input the following parameters we just defined as the Dataset properties (see image below).
- Click the Publish button to save the 2 pipelines just created. When a new window opens on the right-side, click the Publish button again.
Executing and Testing the Demo Scenario
We’re now going to execute the 2 pipelines created. The first will copy records from a table in our source database to our target database. The second will copy a blob from our source storage account to our target storage account.
First, let’s check our database tables to make sure we only have records in our source table. The easiest way to do that is going to the Azure portal, select the database resource and choose the option Query editor(preview). You’ll see a SQL server authentication area where you need to enter the password defined when deploying the resources.
After logging in, execute the following query
SELECT * FROM dbo.FootballClubs
In the source database, you should see 9 records being returned.
In the target database, there won’t be any records.
To check our storage accounts, the best option is again to use the portal, select the storage resource and choose the option Containers. You should then see a container named sourcecontainer or targetcontainer depending on which storage account you’re looking at.
Within the container sourcecontainer, you should see the file FootballClubs.csv.
The container targetcontainer should be empty.
Copy Azure SQL Database data
- Back in ADF, open pipeline1 and choose the option Trigger now
- A window will open on the right-hand side asking for values of all the parameters we defined in the pipeline. Input the following values and click on the OK button.
- SourceServer - fully qualified name of the server such as sourcesqlsrvXXXXXX.database.windows.net
- SourceDatabase - SourceDB
- SourceUsername - saXXXXXX (name of the server admin that can be seen on the portal’s server page)
- SourcePassword - password you defined when deploying the resources
- SourceTable - FootballClubs
- TargetServer - fully qualified name of the server such as targetsqlsrvXXXXXX.database.windows.net
- TargetDatabase - TargetDB
- TargetUsername - saXXXXXX (name of the server admin that can be seen on the portal’s server page)
- TargetPassword - password you defined when deploying the resources
- TargetTable - FootballClubs
- Since the pipeline is now running, we need to go to the Monitor page and after a few seconds you should see an indication of Succeeded for the pipeline. Click on the pipeline name.
- Once you click on the pipeline name, you’ll go to another page where you can see the copy activity completed successfully and you can see the details of the copy activity.
- We can see in the copy activity details that 9 rows were copied. To make sure that happened, we go back to our target database in the portal and run the same query we executed before running the pipeline. You should get 9 rows returned.
Copy Storage Account Blob
Back in ADF, open pipeline2 and choose the option Trigger now.
A window will open on the right-hand side asking for values of all the parameters we defined in the pipeline. Input the following values and click on the OK button.
- SourceAccountUrl - fully qualified name of storage account as https://stsourceXXXX.dfs.core.windows.net/
- SourceContainer - sourcecontainer
- SourceFilename - FootballClubs.csv
- TargetAccountUrl - fully qualified name of storage account as https://sttargetXXXX.dfs.core.windows.net/
- TargetContainer - targetcontainer
- TargetFilename - FootballClubs.csv
- We go back to the Monitor page to see the status of the pipeline. Click on the pipeline name.
- We can see in the copy activity details that 1 file was copied. To make sure that happened, we go to our target storage account in the portal and check the contents of the container. There should be now one file named FootballClubs.csv.
Wrapping Up
On this blog post we saw how we can make use of parameterized LS in ADF. The parameterization of LS will give a lot of flexibility when designing data flows and will be an important item in achieving a metadata driven process.
Don’t forget to delete any resources you’ve created to follow this blog post if you no longer plan on using them.
Thanks for reading!