As a freelance SQL specialist I tend to see the following recurring SSIS problems:
- Sprawling monolithic packages that implement Rube Goldberg / Heath Robinson style solutions.
- No consistent approaches when implementing packages.
- Packages that provide no instrumentation as to what is happening in the greater ETL process, in the worse cases this often results in developers having to fire up BIDS / SSDT in order to open up a package and run it in debug mode, in the worst case I’ve seen this happen on production environments.
- Package that are not conducive to reuse.
The solution to this is something called an “ETL framework”, in this post I will outline my SSIS ETL framework. There are a variety of ways in which an ETL framework can be obtained or developed:
- The purchasing of a commercial framework, such as BI Express from Pragmatic works.
- A free framework such as the BI Monkey framework available on codeplex.
- The “Roll your own” approach.
- A hybrid approach, this is to obtain a free framework and then with the permission of its originator extend it to meet your specific requirements.
Regardless of how you obtain a framework, they all tend to have several things in common:
- A metadata driven ‘Orchestration’ package which is essentially the run time persona of the framework. This is one “One package to rule them all” under which all custom ETL packages should run.
- Metadata to define what packages make up an ETL job, what packages should run in parallel, package execution precedence.
- A database to store the metadata.
- A logging database to record all job activity, it’s a common practise to use this as a foundation for writing reports and dash boards on top of.
- A child package template which all ETL job packages are based on.
What I will outline covers a basic framework that provides a job / package logging infrastructure and metadata based package configuration and orchestration. More advanced frameworks will leverage the use of metadata driven configuration to a much greater extent. There are people in the greater SQL community such as Andy Leonard and Matt Masson who are proponents of a “Pattern based” approach to using SSIS, my recommendation encompasses this mind-set. If starting from scratch, follow a high level road map for implementing a framework.
The one thing I would advise against is the development of an all singing and all dancing framework from the outset, complexity should only be introduced into the framework on an as and when required basis. The approach I do recommend is:
Start off with a basic framework which incorporates metadata data driven child package orchestration, common logging and a basic child package template.
- As a pattern a pattern emerges as to what the common ETL activities are, develop a standard set of child package template that encapsulate these, such activities might typically include: incremental loads, the staging of files, the archival of staged files, maintenance of dimension tables etc.
- Seek opportunities to take the leg work out of implementing new ETL processes by making the framework increasingly metadata driven.
A good framework should abstract all the plumbing style activity away from the ETL developer and be opaque to the child packages that run within it. SQL Server 2012 furnishes a number of framework like capabilities:
- A standard way of dealing with package configuration via parameter-isation and build configurations.
- A catalogue which records package execution statistics and comes with a number of canned reports.
- A project based deployment model which integrates with the visual studio project model much better than the disjointed way of managing and deploying packages prior to SQL Server 2012.
Given the opportunity to do so, I would always recommend the use of these features in SQL Server 2012+.
Lets have a look at my framework, to begin with this is a database diagram for my metadata management database:
The most basic metadata schema that can support a framework consists of a job table and a package table. The JobPackageSchedule table allows packages to be used by more than one job, thus promoting package reuse. The PrecenedenceConstraint table facilitates some control over whether the next package in the schedule is executed pending other package(s) associated with the same execution id having:
- Completed successfully
- Just completed irrespective of whether it failed or no
and / or
- Has loaded data, i.e. has found that the source(s) contained data to load
At the time of writing this, I have only included the ability for child packages to be configured to run if previous packages in the execution of the job have loaded data. In the interests of performance and throughput the framework allows up to eight packages to executed in parallel, this is why the JobPackageSchedule table has the columns PackageId_Channel1, PackageId_Channel2 and so forth. Additionally the framework enables simple metadata controlled parameterisation, by each ‘Channel’ having the potential for an Int or Varchar value to be assigned to it. Whilst being good for throughput, parallel package execution has an added complexity overhead when it comes to recovery from failures. For this reason I advocate that only packages associated with the staging of data are executed in parallel. Assuming that no historical data is present in the staging area, when a job is re-run, the staging packages can truncate whatever has been left in the staging tables from previous executions, irrespective of whether they have been successful or not. To complete the circle, I recommend that the final delivery of data be performed within the same transaction, my preference being to do this within a stored procedure. This whole approach enables roll forward job recovery.
Below is the orchestration package, all packages that run under the control of the framework execute as ‘Child’ packages of this one package:
The image above is intended to help understand the overall shape of the orchestration package. A look at how an agent job is configured to run packages via the framework will help explain how the orchestration package acts as the entry point for the running of all jobs:
Because these screen shots come from an environment in use from one of my consulting engagements, I have obscured any sensitive information. On the package tab a reference is made to the EtlFrameworkRuntime.dtsx package, and on the configuration tab we refer to the name of the job we wish to run ( again, sensitive information is obscured out ):
In this example, the agent job is configured to run the “JDE Import” job.
The orchestration package has two distinct parts which are better explained if we zoom into the SSDT design canvas view of the package control flow:
There first part in the top half of the view checks that the job:
Exists in the metadata
Is not already running
Finally an execution id is obtained for the purpose of ‘Tagging’ all activity relating to the current execution of the job. In the SQL_GetJobPackages container the list of packages to run, package run order and parallel execution requirements is gathered. The results returned by the SQL_GetJobPackages ( as per the query below ) task are stored as a result set in an ADO.NET object.
SELECT chan1.Name AS PackageName_Channel1 ,chan1.Id AS PackageId_Channel1 ,CASE chan2.Enabled WHEN 1 THEN ISNULL(chan2.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel2 ,CASE chan2.Enabled WHEN 1 THEN chan2.Id ELSE 0 END AS PackageId_Channel2 ,CASE chan3.Enabled WHEN 1 THEN ISNULL(chan3.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel3 ,CASE chan3.Enabled WHEN 1 THEN chan3.Id ELSE 0 END AS PackageId_Channel3 ,CASE chan4.Enabled WHEN 1 THEN ISNULL(chan4.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel4 ,CASE chan4.Enabled WHEN 1 THEN chan4.Id ELSE 0 END AS PackageId_Channel4 ,CASE chan5.Enabled WHEN 1 THEN ISNULL(chan5.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel5 ,CASE chan5.Enabled WHEN 1 THEN chan5.Id ELSE 0 END AS PackageId_Channel5 ,CASE chan6.Enabled WHEN 1 THEN ISNULL(chan6.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel6 ,CASE chan6.Enabled WHEN 1 THEN chan6.Id ELSE 0 END AS PackageId_Channel6 ,CASE chan7.Enabled WHEN 1 THEN ISNULL(chan7.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel7 ,CASE chan7.Enabled WHEN 1 THEN chan7.Id ELSE 0 END AS PackageId_Channel7 ,CASE chan8.Enabled WHEN 1 THEN ISNULL(chan8.Name, 'Disable Channel') ELSE 'Disable Channel' END AS PackageName_Channel8 ,CASE chan8.Enabled WHEN 1 THEN chan8.Id ELSE 0 END AS PackageId_Channel8 FROM Job j JOIN JobPackageSchedule pro ON pro.JobId = j.Id JOIN Package chan1 ON pro.PackageId_Channel1 = chan1.Id LEFT JOIN Package chan2 ON pro.PackageId_Channel2 = chan2.Id LEFT JOIN Package chan3 ON pro.PackageId_Channel3 = chan3.Id LEFT JOIN Package chan4 ON pro.PackageId_Channel4 = chan4.Id LEFT JOIN Package chan5 ON pro.PackageId_Channel5 = chan5.Id LEFT JOIN Package chan6 ON pro.PackageId_Channel6 = chan6.Id LEFT JOIN Package chan7 ON pro.PackageId_Channel7 = chan7.Id LEFT JOIN Package chan8 ON pro.PackageId_Channel8 = chan8.Id WHERE j.Name = ? AND chan1.Enabled = 1 AND j.Enabled = 1 ORDER BY pro.ExecutionOrder ASC
The engine room of the package is the FOR_IterateThroughChildPackages container, this can execute up to eight packages in parallel, by default a package will ( and should ) always run in Channel 1. If the SQL_GetJobPackages task finds NULLs for any of the packages in the Name column of the JobPackageSchedule table, it will set the package name to “Disable Channel” causing the execute package tasks for channels 2 through to 8 to be disabled:
Despite the fact the SSIS catalog provides out of the box logging for packages that are deployed to it, it does not cater for the recording of affected row counts, for this reason I use my own logging tables. Custom SSIS logging solutions can be integrated with the SSIS catalogue logging facility via the ServerExecutionId system variable, SSIS Mvp Rafael Salas has written this blog article on how to do this.
The framework uses a standard child package template. Event handlers log the start and end of package execution as well as errors ( should they occur ). To make the framework as flexible and as loosely coupled as possible, the event handlers use framework packages to log package activity to the logging database. Child packages should be as highly cohesive ( single-minded ) as possible, or more simply put contain as few tasks as possible. In order to tie child package execution to that of the entire ETL process, the framework passes the following parameters into each package:
Below is the database diagram for the Utility ( logging ) database:
When the framework starts up one of the first things it does is to obtain an execution id, it does this by inserting a single row into the JobExecution table, the execute id is returned via an OUTPUT clause. Job execution id creation takes place in the SQL_GetExecutionID task and this identifier associates all the package and DML execution with a specific job execution. When developing ETL jobs using the framework a number of conventions should be observed:
- All ETL job child packages are to be based on the standard template package provided with the framework.
- To prevent events fired by the event handlers belonging to containers inside the package do not get caught by the event handlers associated with the child package template, all event handlers inside child packages are to be disabled.
- For the framework to log information data lineage information, specifically what number of rows have been updated / deleted / inserted and in which fully qualified table ( [database_name].[schema_name].[table_name]; the package variables in the screen shot below are to be populated as appropriate:
To complete the subject of templates, below is the stored procedure template that to be used when implementing the delivery stage of the ETL Extract Cleanse Conform and Deliver cycle:
CREATE PROCEDURE [dbo].[uspDeliveryProcedureTemplate] @PackageId int ,@ExecutionId int ,@Channel int AS BEGIN SET NOCOUNT ON; SET XACT_ABORT ON; DECLARE @DatabaseName VARCHAR(40) = DB_NAME() ,@SchemaName VARCHAR(40) = 'MySchemaName' ,@TableName VARCHAR(128) ,@SourceName VARCHAR(40) = 'uspMyProcName' ,@SourceType CHAR = 'P' ,@SqlStatement CHAR = NULL ,@RowsInserted INT = 0 ,@RowsUpdated INT = 0 ,@RowsDeleted INT = 0 ,@RowsRejected INT = 0; BEGIN TRANSACTION BEGIN TRY EXECUTE [Utility].[Logging].[usp_LogDmlExecutionStart] @PackageId ,@ExecutionId ,@Channel ,@SourceType ,@SourceName ,@DatabaseName ,@SchemaName ,@TableName ,@SqlStatement; /* * INSERT, UPDATE or MERGE statement goes here */ EXECUTE [Utility].[Logging].[usp_LogDmlExecutionCompletion] @PackageId ,@ExecutionId ,@Channel ,@SourceType ,@SourceName ,@DatabaseName ,@SchemaName ,@TableName ,@SqlStatement ,@RowsInserted ,@RowsUpdated ,@RowsDeleted ,@RowsRejected; END TRY BEGIN CATCH DECLARE @StartTime DATETIME = GETDATE() ,@ErrorMessage NVARCHAR(255); SELECT @ErrorMessage = ERROR_MESSAGE(); IF @@TRANCOUNT > 0 BEGIN ROLLBACK; END; BEGIN TRANSACTION EXECUTE [Utility].[Logging].[usp_LogDmlExecutionError] @PackageId ,@ExecutionId ,@Channel ,@SourceType ,@SourceName ,@DatabaseName ,@SchemaName ,@TableName ,@StartTime ,@ErrorMessage COMMIT; END CATCH; IF @@TRANCOUNT > 0 BEGIN COMMIT; END; END;
To wrap up this post, I should mention configuration. I’m a big fan of using build configurations, for things such as connection strings my preference is to create project level parameters, associate these with a build configuration. When a child package requires a connection, the project level parameter should be associated with the connection via an expression. The advantage in doing this is that if child packages use common connections, the configuration associated with such connections can be:
- stored in one place
- associated with the SSIS project
- managed under source control using the source control system of choice bound to the project
In part two of this posting I will provide access to the framework packages and their associated database along with instruction on how to deploy and use the framework.
8 thoughts on “An Introduction To SSIS ETL Frameworks Part 1”
just wanted to let you know that the link to Remus Rusanu’s blog is broken (the name is also misspelled).
Great article! enjoyed reading it, was very informative. Did you ever write the follow up article or publish your framework? Reason being I looked at a couple of the available frameworks and they seem massive compared to our requirements. I would rather start simple and add functionality as and when required, as per your approach. Would like to see your 2nd post if you have it, before I make my decision, thanks.
I have the second blog post ready to go and also an updated prototype framework. With the caveat that people are welcome to use it just as long as they accept that it is a prototype and that I may fix bugs as and when I get the time I will make this available via git hub.
I will publish the post and the framework by the end of the week.
I’ve just published a blog post outlining the latest incarnation of the framework, it includes links to github repositories containing the relevant code artefacts, you are most welcome to take a look at this providing the caveat at the end of the post is noted.
Thanks Chris, will definitely take a look and your end point is noted. Also will let you know if I end up using your framework.
Chris, accolades for your ability to present a cohesive introduction. I feel knowledgeable enough now to articulate at a basic level what is being done in our IT shop.