An Introduction To SSIS ETL Frameworks Part 1

As a freelance SQL specialist I tend to see the following recurring SSIS problems:

  • Sprawling monolithic packages that implement Rube Goldberg / Heath Robinson style solutions.
  • No consistent approaches when implementing packages.
  • Packages that provide no instrumentation as to what is happening in the greater ETL process, in the worse cases this often results in developers having to fire up BIDS / SSDT in order to open up a package and run it in debug mode, in the worst case I’ve seen this happen on production environments.
  • Package that are not conducive to reuse.

The solution to this is something called an “ETL framework”, in this post I will outline my SSIS ETL framework. There are a variety of ways in which an ETL framework can be obtained or developed:

  • The purchasing of a commercial framework, such as BI Express from Pragmatic works.
  • A free framework such as the BI Monkey framework available on codeplex.
  • The “Roll your own” approach.
  • A hybrid approach, this is to obtain a free framework and then with the permission of its originator extend it to meet your specific requirements.

Regardless of how you obtain a framework, they all tend to have several things in common:

  • A metadata driven ‘Orchestration’ package which is essentially the run time persona of the framework. This is one “One package to rule them all” under which all custom ETL packages should run.
  • Metadata to define what packages make up an ETL job, what packages should run in parallel, package execution precedence.
  • A database to store the metadata.
  • A logging database to record all job activity, it’s a common practise to use this as a foundation for writing reports and dash boards on top of.
  • A child package template which all ETL job packages are based on.

What I will outline covers a basic framework that provides a job / package logging infrastructure and metadata based package configuration and orchestration. More advanced frameworks will leverage the use of metadata driven configuration to a much greater extent. There are people in the greater SQL community such as Andy Leonard and Matt Masson who are proponents of  a “Pattern based” approach to using SSIS, my recommendation encompasses this mind-set. If starting from scratch, follow a high level road map for implementing a framework.

The one thing I would advise against is the development of an all singing and all dancing framework from the outset, complexity should only be introduced into the framework on an as and when required basis. The approach I do recommend is:

  1. Start off with a basic framework which incorporates metadata data driven child package orchestration, common logging and a basic child package template.
  2. As a pattern a pattern emerges as to what the common ETL activities are, develop a standard set of child package template that encapsulate these, such activities might typically include: incremental loads, the staging of files, the archival of staged files, maintenance of dimension tables etc.
  3. Seek opportunities to take the leg work out of implementing new ETL processes by making the framework increasingly metadata driven.

A good framework should abstract all the plumbing style activity away from the ETL developer and be opaque to the child packages that run within it. SQL Server 2012 furnishes a number of framework like capabilities:

  • A standard way of dealing with package configuration via parameter-isation and build configurations.
  • A catalogue which records package execution statistics and comes with a number of canned reports.
  • A project based deployment model which integrates with the visual studio project model much better than the disjointed way of managing and deploying packages prior to SQL Server 2012.

Given the opportunity to do so, I would always recommend the use of these features in SQL Server 2012+.

Lets have a look at my framework, to begin with this is a database diagram for my metadata management database:

Metadata

The most basic metadata schema that can support a framework consists of a job table and a package table. The JobPackageSchedule table allows packages to be used by more than one job, thus promoting package reuse. The PrecenedenceConstraint table facilitates some control over whether the next package in the schedule is executed pending other package(s) associated with the same execution id having:

  • Completed successfully
    or
  • Just completed irrespective of whether it failed or no
    and / or
  • Has loaded data, i.e. has found that the source(s) contained data to load

At the time of writing this, I have only included the ability for child packages to be configured to run if previous packages in the execution of the job have loaded data. In the interests of performance and throughput the framework allows up to eight packages to executed in parallel, this is why the JobPackageSchedule table has the columns PackageId_Channel1, PackageId_Channel2 and so forth. Additionally the framework enables simple metadata controlled parameterisation, by each ‘Channel’ having the potential for an Int or Varchar value to be assigned to it. Whilst being good for throughput, parallel package execution has an added complexity overhead when it comes to recovery from failures. For this reason I advocate that only packages associated with the staging of data are executed in parallel. Assuming that  no historical data is present in the staging area, when a job is re-run, the staging packages can truncate whatever has been left in the staging tables from previous executions, irrespective of whether they have been successful or not. To complete the circle, I recommend that the final delivery of data be performed within the same transaction, my preference being to do this within a stored procedure. This whole approach enables roll forward job recovery.

Below is the orchestration package, all packages that run under the control of the framework execute as ‘Child’ packages of this one package:

EtlFrameworkRuntime

The image above is intended to help understand the overall shape of the orchestration package. A look at how an agent job is configured to run packages via the framework will help explain how the orchestration package acts as the entry point for the running of all jobs:

JobConfig

Because these screen shots come from an environment in use from one of my consulting engagements, I have obscured any sensitive information. On the package tab a reference is made to the EtlFrameworkRuntime.dtsx package, and on the configuration tab we refer to the name of the job we wish to run ( again, sensitive information is obscured out ):

JobConfig_param

In this example, the agent job is configured to run the “JDE Import” job.

The orchestration package has two distinct parts which are better explained if we zoom into the SSDT design canvas view of the package control flow:

EtlFrameworkRuntime_zi

There first part in the top half of the view checks that the job:

  • Exists in the metadata
  • Is not already running

Finally an execution id is obtained for the purpose of ‘Tagging’ all activity relating to the current execution of the job. In the SQL_GetJobPackages container the list of packages to run, package run order and parallel execution requirements is gathered. The results returned by the SQL_GetJobPackages ( as per the query below ) task are stored as a result set in an ADO.NET object.

SELECT     chan1.Name                                      AS PackageName_Channel1
          ,chan1.Id                                        AS PackageId_Channel1
          ,CASE chan2.Enabled
               WHEN 1
                   THEN ISNULL(chan2.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel2
          ,CASE chan2.Enabled
               WHEN 1
                   THEN chan2.Id
                   ELSE 0
           END	                                           AS PackageId_Channel2
          ,CASE chan3.Enabled
               WHEN 1
                   THEN ISNULL(chan3.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel3
          ,CASE chan3.Enabled
               WHEN 1
                   THEN chan3.Id
                   ELSE 0
           END	                                           AS PackageId_Channel3
          ,CASE chan4.Enabled
               WHEN 1
                   THEN ISNULL(chan4.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel4
          ,CASE chan4.Enabled
               WHEN 1
                   THEN chan4.Id
                   ELSE 0
           END	                                           AS PackageId_Channel4
          ,CASE chan5.Enabled
               WHEN 1
                   THEN ISNULL(chan5.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel5
          ,CASE chan5.Enabled
               WHEN 1
                   THEN chan5.Id
                   ELSE 0
           END	                                           AS PackageId_Channel5
          ,CASE chan6.Enabled
               WHEN 1
                   THEN ISNULL(chan6.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel6
          ,CASE chan6.Enabled
               WHEN 1
                   THEN chan6.Id
                   ELSE 0
           END	                                           AS PackageId_Channel6
          ,CASE chan7.Enabled
               WHEN 1
                   THEN ISNULL(chan7.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel7
          ,CASE chan7.Enabled
               WHEN 1
                   THEN chan7.Id
                   ELSE 0
           END	                                           AS PackageId_Channel7
          ,CASE chan8.Enabled
               WHEN 1
                   THEN ISNULL(chan8.Name, 'Disable Channel')
                   ELSE 'Disable Channel'
           END	                                           AS PackageName_Channel8
          ,CASE chan8.Enabled
               WHEN 1
                   THEN chan8.Id
                   ELSE 0
           END	                                           AS PackageId_Channel8
FROM      Job j
JOIN      JobPackageSchedule pro
ON        pro.JobId = j.Id
JOIN      Package chan1
ON        pro.PackageId_Channel1 = chan1.Id
LEFT JOIN Package chan2
ON        pro.PackageId_Channel2 = chan2.Id
LEFT JOIN Package chan3
ON        pro.PackageId_Channel3 = chan3.Id
LEFT JOIN Package chan4
ON        pro.PackageId_Channel4 = chan4.Id
LEFT JOIN Package chan5
ON        pro.PackageId_Channel5 = chan5.Id
LEFT JOIN Package chan6
ON        pro.PackageId_Channel6 = chan6.Id
LEFT JOIN Package chan7
ON        pro.PackageId_Channel7 = chan7.Id
LEFT JOIN Package chan8
ON        pro.PackageId_Channel8 = chan8.Id
WHERE     j.Name        = ?
AND       chan1.Enabled = 1
AND       j.Enabled     = 1
ORDER BY  pro.ExecutionOrder ASC

The engine room of the package is the FOR_IterateThroughChildPackages container, this can execute up to eight packages in parallel, by default a package will ( and should ) always run in Channel 1. If the SQL_GetJobPackages task finds NULLs for any of the packages in the Name column of the JobPackageSchedule table, it will set the package name to “Disable Channel” causing the execute package tasks for channels 2 through to 8 to be disabled:

disable chan

Despite the fact the SSIS catalog provides out of the box logging for packages that are deployed to it, it does not cater for the recording of affected row counts, for this reason I use my own logging tables. Custom SSIS logging solutions can be integrated with the SSIS catalogue logging facility via the ServerExecutionId system variable, SSIS Mvp Rafael Salas has written this blog article on how to do this.

The framework uses a standard child package template. Event handlers log the start and end of package execution as well as errors ( should they occur ). To make the framework as flexible and as loosely coupled as possible, the event handlers use framework packages to log package activity to the logging database. Child packages should be as highly cohesive ( single-minded ) as possible, or more simply put contain as few tasks as possible. In order to tie child package execution to that of the entire ETL process, the framework passes the following parameters into each package:

Child Parameters

Below is the database diagram for the Utility ( logging ) database:

Utility

When the framework starts up one of the first things it does is to obtain an execution id, it does this by inserting a single row into the JobExecution table, the execute id is returned via an OUTPUT clause. Job execution id creation takes place in the SQL_GetExecutionID task and this identifier associates all the package and DML execution with a specific job execution. When developing ETL jobs using the framework a number of conventions should be observed:

  • All ETL job child packages are to be based on the standard template package provided with the framework.
  • To prevent events fired by the event handlers belonging to containers inside the package do not get caught by the event handlers associated with the child package template, all event handlers inside child packages are to be  disabled.
  • For the framework to log information data lineage information, specifically what number of rows have been updated / deleted / inserted and in which fully qualified table ( [database_name].[schema_name].[table_name]; the package variables in the screen shot below are to be populated as appropriate:

child package vars

To complete the subject of templates, below is the stored procedure template that to be used when implementing the delivery stage of the ETL Extract Cleanse Conform and Deliver cycle:

CREATE PROCEDURE [dbo].[uspDeliveryProcedureTemplate]
     @PackageId int
    ,@ExecutionId int
    ,@Channel int
AS
BEGIN
    SET NOCOUNT ON;
    SET XACT_ABORT ON;

    DECLARE  @DatabaseName VARCHAR(40)   = DB_NAME()
            ,@SchemaName   VARCHAR(40)   = 'MySchemaName'
            ,@TableName    VARCHAR(128)
            ,@SourceName   VARCHAR(40)   = 'uspMyProcName'
            ,@SourceType   CHAR          = 'P'
            ,@SqlStatement CHAR          = NULL
            ,@RowsInserted INT           = 0
            ,@RowsUpdated  INT           = 0
            ,@RowsDeleted  INT           = 0
            ,@RowsRejected INT           = 0;

    BEGIN TRANSACTION
        BEGIN TRY
            EXECUTE [Utility].[Logging].[usp_LogDmlExecutionStart]                  @PackageId
                ,@ExecutionId
                ,@Channel
                ,@SourceType
                ,@SourceName
                ,@DatabaseName
                ,@SchemaName
                ,@TableName
                ,@SqlStatement;

            /*
             * INSERT, UPDATE or MERGE statement goes here
             */

            EXECUTE [Utility].[Logging].[usp_LogDmlExecutionCompletion]
                @PackageId
               ,@ExecutionId
               ,@Channel
               ,@SourceType
               ,@SourceName
               ,@DatabaseName
               ,@SchemaName
               ,@TableName
               ,@SqlStatement
               ,@RowsInserted
               ,@RowsUpdated
               ,@RowsDeleted
               ,@RowsRejected;

        END TRY
        BEGIN CATCH
            DECLARE  @StartTime    DATETIME = GETDATE()
                    ,@ErrorMessage NVARCHAR(255);

            SELECT   @ErrorMessage = ERROR_MESSAGE();

            IF @@TRANCOUNT > 0
            BEGIN
                ROLLBACK;
            END;

            BEGIN TRANSACTION
                EXECUTE [Utility].[Logging].[usp_LogDmlExecutionError]
                     @PackageId
                    ,@ExecutionId
                    ,@Channel
                    ,@SourceType
                    ,@SourceName
                    ,@DatabaseName
                    ,@SchemaName
                    ,@TableName
                    ,@StartTime
                    ,@ErrorMessage
            COMMIT;
        END CATCH;

    IF @@TRANCOUNT > 0
    BEGIN
        COMMIT;
    END;
END;

To wrap up this post, I should mention configuration. I’m a big fan of using build configurations, for things such as connection strings my preference is to create project level parameters, associate these with a build configuration. When a child package requires a connection, the project level parameter should be associated with the connection via an expression. The advantage in doing this is that if child packages use common connections, the configuration associated with such connections can be:

  • stored in one place
  • associated with the SSIS project
  • standardised
  • managed under source control using the source control system of choice bound to the project

In part two of this posting I will provide access to the framework packages and their associated database along with instruction on how to deploy and use the framework.

8 thoughts on “An Introduction To SSIS ETL Frameworks Part 1

  1. Hi Chris,

    just wanted to let you know that the link to Remus Rusanu’s blog is broken (the name is also misspelled).

  2. Hi Chris,
    Great article! enjoyed reading it, was very informative. Did you ever write the follow up article or publish your framework? Reason being I looked at a couple of the available frameworks and they seem massive compared to our requirements. I would rather start simple and add functionality as and when required, as per your approach. Would like to see your 2nd post if you have it, before I make my decision, thanks.

    • I have the second blog post ready to go and also an updated prototype framework. With the caveat that people are welcome to use it just as long as they accept that it is a prototype and that I may fix bugs as and when I get the time I will make this available via git hub.

    • I’ve just published a blog post outlining the latest incarnation of the framework, it includes links to github repositories containing the relevant code artefacts, you are most welcome to take a look at this providing the caveat at the end of the post is noted.

      • Thanks Chris, will definitely take a look and your end point is noted. Also will let you know if I end up using your framework.

  3. Chris, accolades for your ability to present a cohesive introduction. I feel knowledgeable enough now to articulate at a basic level what is being done in our IT shop.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s