Get your ETL flow under control using 3sigma control Limits

THEORY PART

Hey everyone, thanks for joining this presentation, it’s great to see you here. Before we start, let me introduce myself and what I’ll be talking about today. My name is Pavel Prudky and I’ve been working with data, databases and reporting development for my whole professional life. In the beginning of my career, I was strictly a database and reporting developer and in the last 6 years, I’m focused mainly on Business Intelligence development and lately also on data and ETL engineering. I have worked for all kinds of companies, from corporate & enterprise size to startups figuring out that startups work better for me. Now that would be enough about me, let’s talk about what’s ahead of us in this presentation.

As stated in the BWS homepage, I would like to talk about getting your ETL job flow under control using 3 sigma control Limits. And what does this mean? Imagine yourself in a situation where you are a part of a data processing team, let’s say you have flat files, csv or xml coming in as the sources and your job is to make sure the data goes flawlessly through the whole data pipeline to the target systems, most likely CRM applications, risk modules, reporting systems et cetera. Now in every stage of my career, in every company I worked for, I always wanted to be on top of things knowing what is happening first without the data related incidents being raised by your boss, the business or some other folks from the client’s management. And when I say what is happening, I mean what and where and when is failing.


Typically you load a file, your ETL logging mechanism inside of your ETL framework captures some meta information like filename, row count, when was the file created etc. and a proper logging mechanism will store at least the row count, the start timestamp, the end timestamp and the status in every stage of the processing. With this information persisted inside of a logging data-mart or maybe even just in one large logging table, you are able to start applying statistical rules on top and get things under your control. Today I would like to explain how to use the 'Three-Sigma Limits' for this purpose.


Three-sigma limit (3-sigma limits) is a statistical calculation that refers to data within three standard deviations from a mean. In business applications, three-sigma refers to processes that operate efficiently and produce items of the highest quality. The so-called three-sigma rule of thumb expresses a conventional heuristic that nearly all values are taken to lie within three standard deviations of the mean, and thus it is empirically useful to treat 99.7% probability as near certainty. Related to our ETL logging, I will use this rule to determine whether an attribute tracked in the ETL log ( Row count of an file extracted in a specific stage, Processing duration of a file inside a specific stage ) lies inside the 3 sigma control limit. In case it does lie in, that would mean, that the value is not an outlier, not something to consider a problem, all’s good. (for now)


My aim in this presentation / demo is not to go too deep inside of statistical control processing theory. There is plenty of resources out there on this topic. My main focus is to show, that adding this layer to your ETL processing is not too complicated ,setting up the components is not too demanding and in general, in one hour, you can have this up and running and see for yourself, if this solution will be beneficial for you or not.


So let’s dive in a little deeper. As mentioned earlier, my development goal is to see how many rows above or below the 3 sigma control limit I have in the data processing ETL framework. I would also like to know if any file’s processing duration in a specific stage is above or below the control limits. For this demo, I prepared a sample Logger history table stored in SQLSERVER 2017. The table looks something like this:

logger_history schem


/ Opening the SSMS studio /

As you can see, these dummy attributes are being persisted with my ETL logging mechanisms inside the framework in the granularity of “File – Processing Server Name - Stage Name“. Let’s have a brief look at the values. I believe you’ll agree with me, that this is nothing outstanding, just a standard normalized logging table. I would like to point out the Moved_To_ES and IsProcessed Boolean attributes. In case you’ll get interested in this project, you might find it also very useful to have a de-normalized check name – check value View on top of this table.
I will show you how I build a Stored Procedure operating on top of this logger_history table and how it’s used to determine the outliers.


/script out the code of gen_3sigma_event SP/ 

First block of code is calculating the mean and the standard deviations based on the grouped by columns I chose.


/going through the code/ 

Second block of code is used to calculate the outliers and mark all the checked rows with a flag
In order for those values to not be processed again with the next scheduled run of this SP.


/going through the code/ 

Third block of code is used to iterate through the outliers, dynamically generate the SQL statements that will be used to call system SP xp_logevent that will pass those values to the Windows Event Log for us. I am also marking the outliers with a flag so they do not get sent over to the event log next time. Also check and see how I iterate in a dataset, this is a cool way of substituting SQL Server cursors in case you’re not using it.


Now to the determined outlier values – I would like to pass them to some company-wide monitoring platform, in this demo case, I chose Elasticsearch with the results visible in a Kibana dashboard. The way how this is done, and yes I’m aware there are other options how to do this, is like this:
SQL Server ETL logging database – SQL Agent job scheduling a Stored Procedure passing events to Windows Event Log - Winlogbeat – Elasticsearch (Kibana)

 

flow diagram

It gets a little bit hacky, but what I like is that you can build this on your own from the ground while being able to customize almost everything. I assume you had enough of talking, that makes sense, but before we get our hands dirty, I need to introduce step by step the key components of this setup.

Prerequisites to get us started:

DEMO PART

Now let’s get to the demo part. I have already installed all the pre-requisites to save some time so let’s have a look what we have in place and let’s review the coding needed. I will login to SSMS local instance, load legacy dummy ETL logging data that I know will generate SCP outliers, show again around the data model and script out the Stored Procedure for capturing the outliers to make sure everyone understands. I will alter this SP to print out the dynamically generated statements so we can see what gets passed to the Windows Event Log. I will review the code to see how is a mean and standard deviations calculated on top of the data. I will go through the xp_logevent function and what I build around it to make sure everyone understands how to pass the event to the Windows Event Log and how to make sure we don’t have the data distorted by other Windows events. I will create a SQL Agent job running each 5 minutes that will execute this Stored Procedure. We will review the Windows Event log events raised inside of the Windows Event Viewer.

Next step will be going through the configuration YAML file of the Winlogbeat , making sure we capture only the events we are interested in and making sure we pass the data to Elasticsearch installed on our local machine. We will start Winlogbeat as a Windows service in PowerShell, I will start the ElasticSearch service and proceed further to Kibana, where we should be able to see the events being generated. We might need to setup the indexing in a proper way and then we will continue to the last part, and that would be creating the timeline dashboard.

WRAPPING UP

As you can see from the dashboard, we are able to see our events based on outliers in Kibana. Now I consider this maybe more of a beginning, because this is where you can start building your own solution from. And I believe it’s generally important to critically look at this solution and see the pros and cons.

PROS:

CONS:

OTHER POSSIBLE SOLUTIONS:

MY TIPS:

Thanks for your attention, now is the time for Q&A.