This much delayed part 2 of my introduction to an SSIS ETL framework covers the latest iteration of my SSIS ETL framework which I will make freely available to anyone who wishes to use it.
Since I wrote the original blog post I have made some enhancements to my framework, namely:
- Application packages are now executed using SSIS catalog stored procedures, this means that application packages no longer have to reside in the same project as the framework itself.
- The level of SSIS catalog logging can now be controlled by metadata.
- The 32 or 64 bit run time can be specified for each application child package via metadata.
- A very simple form of metadata controlled precedence has been built into the framework, this means that you can control whether a package is executed based on whether a package that has already executed has inserted / updated or deleted data (rows).
A Recap On What The Framework Includes
- An ETL Framework project This embodies the framework run time and includes a template package
( TaskPackageTemplate ) on which all application packages should be based.
- The metadata database The EtlFrameworkMetadata contains all the necessary table for specifying ETL job child package composition, running order and package configuration.
- The logging database EtlFrameworkLogging
- You must use integration services with SQL 2012 onwards.
- Any application packages developed should use the project deployment model and be deployed to a separate project from the framework.
How To Use The Framework
- Deploy the framework to the SSIS catalog.
- Create your application project and deploy it to the same catalog as the framework.
- All packages in the application project should be based on TaskPackageTemplate.dtsx, when developing your packages I would implore you to observe these ‘Holy’ commands of SSIS package development:
#1 Packages should be as simple as possible
Each package should contain one data flow or one task at most. Do not boil the ocean in a single package, you would not write an entire program in a single C# class, so why would you write an entire ETL application in a single package ?. Writing monolithic modules in any language is the worse possible thing that can be done in order to undermine reusability.
#2 Avoid burying ETL logic in scripting tasks if at all possible
The whole point of integration services is to furnish an ETL toolbox such that you do not have to write swathes of code in a 3GL, yes there might be legitimate circumstances when you need to do this, but this should be exception rather than the rule.
#3 All packages should be instrumented
Integration services provides this out of the box for packages deployed to the catalog and the framework provides its own logging infrastructure on top of this. I cannot overstate this point as I hate seeing developers open packages on production servers for execution in debug mode in order to see what the packages are doing, and when I say I hate this, I hate this with an absolute passion.
#4 Use A ‘Pattern’ Based Approach To Development
This is a topic which Andy Leonard champions, essentially when you are developing anything (within reason) the chances are that someone else has already done this and that for standard (ish) problems there are solutions that have been widely used, proven to work and are documented. The generic ‘Term’ for such a solution is a ‘Pattern’. To illustrate this, here are two SSIS patterns:
The package template contains the following variables:
When developing your packages 99.9% of the time they will effect a table in a database in terms of insert / updating / or deleting rows, your package should set these variables appropriately, the logging element of the framework will then look after the rest.
Once you have written all of your packages, the next step is creating the metadata to ‘Wire’ these together, this is the metadata schema:
4. The Project table is the first metadata table which should be populated. To illustrate using the framework I have created a project called “Sample”, this is deployed to the “ETL Framework” folder:
INSERT INTO [dbo].[Project] ( [Name] ,[Folder]) VALUES ( 'Example' ,'ETL Framework')
5. Create a row in the Package table for each application child package, in this very simple example, I’m going to create three child packages, packages: PackageA and PackageB will be executed in parallel and the execution of the third: PackageC will be contingent on PackageB having updated rows in the database:
INSERT INTO [dbo].[Package] ( [Name] ,[Enabled] ,[ProjectId] ,[Use32BitRunTime] ,[CatalogLoggingLevel]) VALUES ( 'PackageA.dtsx' ,1 ,1 ,0 ,3 ) ,( 'PackageB.dtsx' ,1 ,1 ,0 ,3 ) ,( 'PackageC.dtsx' ,1 ,1 ,0 ,3 )
6. The next step is to create a ‘Job’ by inserting a row into the Job table. Assuming the contents of the Id column for the row inserted into the Project table from the last step is ‘1’, this is what the insert statement looks like for this:
INSERT INTO [dbo].[Job] ( [Name] ,[Enabled]) VALUES ( 'Example Job', ,1 )
7. One of the most powerful features of the framework is the ability it provides to ‘Wire’ packages together via metadata and its this step in which this is done. The JobPackageSchedule table provides the means of wiring packages to jobs and allows packages to be reused across jobs. This is one of the reasons why I stress that packages should be as simple as possible in order to foster reusability. The JobPackageSchedule table enables upto eight packages to be executed in parallel via ‘Channel’, the columns suffixed by ‘1’ are for packages executed on channel 1, the columns suffixed by ‘2’ are for packages executed on channel 2 and so forth.
INSERT INTO [dbo].[JobPackageSchedule] ( [JobId] ,[PackageId_Channel1] ,[PackageId_Channel2] ,[ExecutionOrder]) VALUES ( 1 ,1 ,2 ,1) INSERT INTO [dbo].[JobPackageSchedule] ( [JobId] ,[PackageId_Channel1] ,[ExecutionOrder]) VALUES ( 1 ,3 ,2)
8. The final thing I am going to do and this is completely optional is create a precedence constraint between one of the packages that was executed in parallel and the final package to be executed, this is done via the PrecendenceConstraint table. I’m going to create a row in this table such that package 3 will only execute if package 2 has updated one or more rows:
INSERT INTO [dbo].[PrecedenceContraint] ( [PackageId] ,[LoadedData] ,[UpdatedData] ,[DeletedData] ,[RejectedData] ,[DependantPackageId]) VALUES ( 3 ,0 ,1 ,0 ,0 ,2)
With the caveat that the initial release of the framework has not been exhaustively tested and as such it may contain bugs, which I accept no liability for nor make any commitment as to fixing them, readers of this post are welcome to download the framework from github. There are three github repositories:
- The SSIS project for the orchestration part of the framework
- A database project for the framework logging database
- A database project for the framework metadata database
Feedback on this is most welcome, enjoy !.