Data Quality checks in ETL Process
Data Quality checks in ETL Process
Essential Quality Checks in ETL
There is always an exponentially increase in cost is associated with finding software defects in the later stages of the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions. So the early detection of these system defects is prioritized as critical checks along with which a regular process of doing continuous testing using a CI pipeline. And can this be implemented?
Maintaining Quality in ETL projects is a very complex process because of many reasons. It depends on the complexity of the system which is in development and in use. Typically huge amounts of data are handled in an ETL pipeline or in a data orchestration systems and data often comes from multiple sources which need to be combined, transformed and then loaded to target schemas. During the initial setup of client systems in production, it will be loaded as a full load and then the scheduled load runs on an ongoing basis (daily, nightly, weekly/monthly loads) is treated as incremental/delta loads.
ETL work-flows involves all kinds of complex calculations and transformations on the data based on client needs. And how often these calculations/transformation methods remain to be the same and do they make similar patterns? some can have similarity and often it changes based on the client needs. If we can trace out all the minor to major patterns, similarities and occurring in this entire process of loading data to target schemas from the extracts, then it is possible to accommodate a test suite that validates these checks on a regular basis, whereby which it is possible to find defects report them and maintain quality integrated with the system on a regular basis.
There are several advanced tools available for software testing, but testing ETLs is something completely different. Tools dedicated to ETL testing exist but mostly paid tools and paid tools will not be an evolved tool. There are open-source tools in the market but mature ones are still rare. In Many IT industries, ETL systems are mostly tested manually, which is very labor-intensive and prone to errors. Running the tests manually without a dedicated tool to schedule them, also prevents these tests from being automated.
However, as for any IT project, recurrent testing is important to be able to guarantee a high level of quality embedded with the customer’s data and the software systems that they use for storing these data. The more we test, the more bugs will be solved before going live. And that is crucial in Business Intelligence as when the users can’t trust the data, the acceptance of the BI solution will definitely fail.
How to test ETL-processes, its challenges, and solutions:
When adequately validating your ETL-processes, several tests need to be executed before being able to conclude the ETL is working as it is supposed to do.
Data completeness:
Probably the most obvious test is to check whether all records have been loaded into the data warehouse. For some ETLs, this can be as easy as comparing a record count of the source and the target table. But unfortunately it is not always that easy: data can be transformed to a completely different structure following some kind of data modelling technique to maintain historical information, to allow complex but fast calculations or to improve reporting performance by storing aggregated results. Like in the case of data orchestration systems where more sophisticated tests might be required.
Most data warehouses also have a layered architecture: often only staging and a Kimball-like data warehouse, with some extra landing zone per layer, or an operational data store as an extra layer. Staging can be persistent but when the volume of data increases like in big data systems were terabytes of data needs to be processed, then it is not. Big data volumes also make it very inefficient to load all the data from scratch. Most ETLs, therefore, apply some kind of delta mechanism to only load new and changed records. It is crucial to not only test the results of the full load but to test the delta mechanism as well. Whether all new records coming in without creating duplicates and are all changes applied correctly while maintaining historical information.
The data completeness checks embedded within the code will ensure that all expected data is loaded into the target table.
1. Compare record counts between source and target that checks for any rejected records.
2. Checks on truncated data to make sure that data should not be truncated in the column of the target table.
3. Check whether unique values are loaded into the target. No duplicate records should be existing.
4. Check boundary value analysis (ex: only >12-31-2018 year data has to load into the target)
Data cleanness:
Unnecessary columns should be deleted during the pre-staging process of an ETL pipeline. Examples of validation for data cleanness is mentioned here.
Remove unwanted char/text:
Example 1: If a column have name but it is taking extra space , we have to "trim" space so before loading in the staging area with the help of expression transformation space will be trimmed. Removal of unwanted symbols and quotations is also taken care in this process.Merging Columns:
Example 2: Suppose telephone number and STD code in different columns and requirement says it should be in one column then that also with the help of expression transformation concatenation of these values is done in one column. Proper checks to make sure on these are as desired.
Data transformation rules:
During a transformation process in an ETL, column values are changed and newly calculated columns are created. These transformations need to be validated. While all these transformations should be documented in some way as this can be easily forgotten. It’s obvious that good testing goes hand in hand with good documentation.
1.Number check: if in the source format of numbering the columns are as xx_30 but if the target is only 30 then it has to load not pre_fix(xx_) .. we need to validate.
2.Date Check: follow Date format and it should be same across all the records. Standard format : yyyy-mm-dd etc..
3.Precision Check: Precision value should display as expected in the target table. Example: In source 19.123456 but in the target it should display as 19.123 or round of 20. This can be round of values and trucate of values
4.Null Check: Few columns should display "Null" based on business requirement Example: Termination Date column should display null unless & until it has "Active status"
5.Flag check: Some columns would have the flag values. A thorough verification to make sure whether anyother value other than these flag values are present. Flags such as "T"/"true" or "F"/"false" , "Y"/"yes" or "N"/"no"
Business rules:
It is very much important to make sure whether correct enforcement of business rules that have been identified. Values in columns should be within a certain range and some columns cannot be empty, etc. To avoid loading “dirty data into the data warehouse, this is again an important test that should be executed during ETL testing.
1. Individual Data Check: Based on business logic, few records which do not meet certain criteria should be filtered out. Example: only record whose date_sid >=2008 and GLAccount != 'CM001' should only load in the target table. This can be based on filter conditions usage of router transformations to filter out specifc data into target table.
2. A Complete data check between source and target by comparing them altogether by running a minus/ except query/ or inner query with not condition based on what suits your needs for a particular project. Some cases the project might also demand to write an entire block of query with all the table level rules and buisness logics built-in using an SQL query and such queries retreives data from source tables that can be compared with target table developed using the ETL. To avoid performance issues while running these queries, i) This needs to be built, tested and validated on the identified small datasets of complete data which could avoid performance degradation of running these queries in bulk. ii) Using proper indexing over the block of query built over the source tables. iii) Based on appropriate mechanism used for comparing with the target blocks of data.
Data Integrity:
A lot of tables in the DWH, especially fact tables, contain a lot of foreign keys to related tables and referential integrity in a data warehouse should not be enforced by foreign key constraints. In order to improve ETL performance, you should regularly validate the integrity of these keys to prevent orphan records.
Referential integrity checks: This makes sure the relationship that exists between parent and child table are appropriate as per logical models
Cascade table checks: In a staging, environment the records loaded from an OLTP database would not accommodate this cascade mechanism at the database level, as a result, this has to be entirely handled by the ETL process as separate jobs. Hence validation of these cascade level rules by identifying the parent and child tables are linked and integrated properly is very much necessary.
Entity change logs: When altering record activities such as insert, update delete from an OLTP layer the information needs to be passed and actioned across their cascaded tables. A check to make sure whether the ETL pipelines are defined with these steps is necessary.
Key Exclusion/Inclusion checks: Some keys mights be excluded from the OLTP systems at application level, though this might be processed in the OLTP database level but at staging level these might not be handled or taken care. ETL processes should be designed to process them. Hence checks to valdiate them at earlier stages can detect defects.
Standards and Conventions
Besides testing the correctness of the data, it is also recommended to plan some technical tests. Every data warehouse environment should respect some kind of coding standards, policies and naming conventions that have been agreed with the data warehouse architect. Executing a regular test to check to make sure whether all developers respect these guidelines helps you to maintain the stability, performance, and scalability of the data warehouse environment.
Some examples of these technical tests are:
1.Dimension table checks: All dimension tables have a unique and numeric primary key, foreign key
2.Index checks: All tables have a unique index, containing the business key columns
3.Partitioning checks: Is partitioning done for table/data is as expected
4.ETL Load audit checks: Are all audit columns available in every table
5.Audit log checks: checks to make sure whether all etl jobs generates valid log files in the system
6.DDL check: This is a highly recomended check by data architects to make sure that the data table definition defined for all tables follow the appropriate physical data models defined in the system. This can also keep changing based on customer requests. Hence an automated mechanism to test it on regular basis is thereby expected. 7.DML/DCL/TCL checks: These three checks as part of a database model is not always mandatory to be checked as it is a database architectural level based control operations, hence may be a check to make sure whether they are all in place and working as expected can be tested.
How can these tests and checks be automated and ran continuously in an agile environment?
Price based ETL test solutions/tools
The easiest -but most expensive- option is to buy a dedicated tool for testing your ETLs. For example, if you use Informatica Powercenter, you could buy Informatica Data Validation, which provides complete and effective data validation and ETL testing, with no programming skills required. If your ETL solution vendor doesn’t offer such a tool, you could have a look at a more generic testing solution. One of the most popular and advanced tools is probably QuerySurge. QuerySurge was built specifically to automate the testing of data and supports several data warehouse technologies and even Hadoop, NoSQL, XML files, etc. Microsoft offers Visual Studio Test Professional. This is also a great tool for testing programming code and does have support for database unit testing.
However, as it is not dedicated to ETL Testing you will have to build custom test providers. And it is also missing some nice features like data quality dashboards, out-of-the-box access to multiple platforms, support for testers with no programming skills, etc. For those who seek for priced based solutions, this article is of no use. But still, I would recommend for those who are using priced based solutions to go for an open-source tool which always has a wide community reach for periodic development based on data warehouse developers/architect suggestions and customer feedback.
Open source solutions:
There are some more tools, even open-source, for example, DbFit ( http://dbfit.github.io/dbfit) and AnyDbTest ( https://anydbtest.codeplex.com/), but they are often not very mature.
If you prefer not to purchase a dedicated testing tool, you can also create something yourself. It won’t be perfect, but it still is a lot better than not automating your tests at all. When developers are accurately testing their ETLs, they are already writing a lot of SQL scripts for this. Automating these same tests means you develop a solution to store these scripts, schedule, execute and automate the comparison of the outcome of the test scripts with the expected results.
In your data warehouse, all you need is some kind of configuration table to store the test scripts, the return value of these scripts and the expected results to check if the test has succeeded or failed. As PostgreSQL can connect to almost any data source you can easily build a Package that executes all test scripts and compares results between multiple databases and environments. You can also use an SQL Agent to schedule the tests and a separate email mechanism to publish the results. Of course, writing reusable scripts is a little harder than writing “quick and dirty test scripts that are written to be used only once. But enforcing standards in your BI solution will already make this a lot easier. And don’t forget that these scripts can be scheduled to run again, again and again in a regression and integration phase of testing.
One of the much complex mechanism in the process of ETL testing is the testbeds setup which includes both environment and test data setup. Test data identification plays a much greater role in testing ETL systems. Sometimes one might have to go down the line for benchmarking the data from production and write scenarios to test them. Setting the test data with all the checks mentioned in the above sections of this article for all the tables and ETL pipelines in a data warehouse is not that quite an easier job and it consumes time. But once it is set then it goes into continuous testing. Then you can easily add it to an existing CI/CD pipeline by integrating with Jenkins else you can create a new one.
To know more in detail on how to create a ETL test framework and integrate over CI engine visit on below references provided
References: