Creating Custom Data processing Schedule in Azure Data Factory
Scheduling is an integral part of the Data Pipelines, we have often received requirement like “Pipeline needs to run every day at 10AM UTC” or “Data Refresh has to be weekly with day of the week as Wednesday” or “Complex schedule like run specific pipelines twice every day at 5AM UTC and 12AM UTC but not on Fridays”.
Scheduling requirements are driven by Source Data Availability , When Business is ready to consume and infer from the data, Ad-hoc Data Requirements etc.
When we are working on Azure as our Cloud platform, The Data Pipelines are Authored, Orchestrated & Scheduled using Azure Data Factory.
What is Azure Data Factory or ADF ?
It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores
Scheduling in Azure Data Factory
Lets focus on Scheduling aspect of ADF, For scheduling a Data Pipeline in ADF , we use the concept of Triggers
Types of Trigger supported in ADF:
- Time based Triggers
- Scheduled : Run the schedule based on date/specific timezone/ recurrence etc.
- Tumbling : They are series of fixed-sized, non-overlapping, and contiguous time intervals - Event based Trigger
- Storage Events : Part of Event Driven Architecture (EDA) , schedule is triggered based on arrival , deletion of files in Azure Storage.
- Custom Events : Again part of Event Driven Architecture (EDA), schedule is based on Azure Event grid events.
Custom Scheduling in ADF
Since we have a fair understanding of how Triggers work in ADF, lets say we have a scenario in which a specific Data Pipeline has to be triggered twice in a month but the days/weeks of schedule are different for different months
Lets say the below table explains how we want to schedule our Data Pipelines
We can implement the above use-case using Schedule based time Trigger but it would become cumbersome and tough to maintain as and when more complexity is added to the schedule.
Lets look at one simple way to implement this type of scheduling requirement.
Logical Steps for implementing the requirement
- Lets Create a CALENDAR table (sample in the below image) capturing Date, Weekday, Month, Quarter, Week Number etc. , This type of DIMENSION table is commonly available or we can easily derive one using DATE functions
- Next lets create a simple User-Defined Table Function (Function that returns a Table as output) in Azure SQL
- This Function takes DATE as input parameter and provides a table as an output with columns for Pipeline Name & Type of Data Load. (Sample function code below). This logic can be tailored based on specific requirements.
CREATE OR ALTER FUNCTION [dbo].[execution_schedule](@current_date nvarchar(10))
RETURNS @Pipelines TABLE (
pipelineName VARCHAR(50),
loadType VARCHAR(10)
)
AS
BEGIN
DECLARE @weekday varchar(10);
DECLARE @weeknumber varchar(10);
DECLARE @monthvalue varchar(10);
DECLARE @datevalue varchar(10);
SELECT
@weekday = [WEEKDAY],
@weeknumber = [WEEKNUMBER],
@monthvalue = [MONTHVALUE]
FROM [dbo].[Calendar]
WHERE [DATEVALUE] = @current_date;
if @monthvalue != 'DECEMBER' and ((@weeknumber = 'W1'and @weekday = 'TUESDAY') or (@weeknumber = 'W2'and @weekday = 'FRIDAY'))
INSERT INTO @Pipelines values ('pipe1', 'FULL_LOAD');
else if @monthvalue = 'DECEMBER' and ((@weeknumber = 'W1'and @weekday = 'FRIDAY') or (@weeknumber = 'W2'and @weekday = 'THURSDAY'))
INSERT INTO @Pipelines values ('pipe1', 'DELTA_LOAD');
if (@weeknumber = 'W1'and @weekday = 'TUESDAY')
INSERT INTO @Pipelines values ('pipe2', 'FULL_LOAD')
RETURN;
END;
GO- In ADF, lets create a PIPELINE, add a LOOKUP activity. Use the function as SELECT query ( SELECT [DBO].[execution_schedule](‘20230221’) ) in Settings. Based on the schedule once the pipeline runs, it return a table as discussed above with Details about Pipeline to run and the Data Load type.
The table would be looped using a FOREACH activity as shown below
- Inside FOREACH activity , we will add a SWITCH activity which will define the scenario where respective pipelines would be executed.
Conclusion
The explanation above shows how we can leverage ADF with Azure SQL to implement complex scheduling by masking the complexity in SQL and making the ADF Pipelines lean and re-usable. We can extend this functionality to include all relevant parameters for a pipeline execution as a JSON string in the Config table in Azure SQL.
While designing a Solution , fundamental principle to be targeted is that we should not try to fit in everything into one resource / framework or tool set. The power of working in Cloud is availability of plethora of resources , so Solution(s) should be designed keeping in mind integration among different resources / frameworks or tool sets.