A complete information to make your knowledge pipelines testable, maintainable and dependable
Why is it crucial to check your knowledge pipelines?
Embedding applicable exams to your knowledge pipelines makes them much less bug-prone and in addition makes certain the info goes by correct knowledge high quality checks, earlier than flowing to the top knowledge shoppers.
The 2 key parts of any knowledge pipeline are “code” and “knowledge”. Code is used as a device to handle the right way to Extract, Rework and Load (ETL) the info, whereas knowledge is the ingredient of the info pipelines. To be trustworthy, most knowledge pipeline complexities stay within the knowledge, as an alternative of the code. With a purpose to construct and function a dependable knowledge pipeline, doing solely commonplace code testing is just not sufficient. Therefore, once we discuss testing knowledge pipelines, we want to ensure each code and knowledge are examined correctly.
Subsequently, immediately’s article will likely be divided into 2 components:
- Code testing — Just like the exams on conventional software program / functions. The code exams embrace unit exams, integration exams and end-to-end system exams. Code testing is mostly performed as a part of Steady Integration (CI) pipelines to ensure the code high quality is respectable and the features of knowledge ingestions, knowledge transformations and knowledge loading behave as anticipated.
- Information testing — Set expectations on vital knowledge components and ensure the info passes these expectation checks earlier than serving the top knowledge to shoppers. Information testing requires steady effort, and is much more vital as soon as the info pipelines are deployed to the manufacturing environments. Apart from testing the info high quality, it’s also extraordinarily crucial to watch the testing outcomes to ensure any knowledge high quality violations will be fastened instantly.
I’ll first discuss code testing after which knowledge testing. In the long run, I’ll share some open-source instruments and frameworks that you would be able to leverage to begin including crucial exams to your knowledge pipelines.
Code Testing
Code testing for knowledge pipelines is just not a lot completely different from code testing for software program. It typically consists of unit testing, integration testing in addition to end-to-end system testing. Nonetheless there are a few options with knowledge pipelines, that make it tougher to do code testing on knowledge pipelines.
- Firstly, knowledge pipelines are extraordinarily knowledge dependent, subsequently you will have to generate pattern knowledge — generally fairly giant volumes of pattern knowledge — for testing functions.
- Secondly, to execute knowledge pipelines requires heavy dependencies on exterior techniques, together with processing techniques like Spark and Databricks, and knowledge warehouses similar to Snowflake, Redshift and Databricks SQL. Subsequently you want to discover methods to do testing independently — separating testing the features of knowledge processing logics from testing the connections and interactions with these exterior techniques.
Let’s discuss unit testing and integration testing individually and perceive how every works specifically for knowledge pipelines.
- Unit testing — Unit exams are very low stage and near the supply code of an software. They include testing particular person strategies and features utilized in your knowledge pipelines. The aim of unit exams for knowledge pipelines is to catch errors with out provisioning a heavy exterior atmosphere. These errors might embrace refactoring errors, syntax errors in interpreted languages, configuration errors, graph construction errors, and so forth. For instance, it’s possible you’ll require fairly just a few knowledge transformation features as a way to derive the ultimate outcomes. You may carry out unit exams on these transformations features to ensure they generate knowledge as anticipated. As we mentioned initially of this text, knowledge is the ingredient of any knowledge pipeline. Subsequently to carry out unit exams on the features used “in” the info pipeline, you additionally want to ensure there may be knowledge accessible for the check. Therefore, there two widespread methods to acquire adequate knowledge for testing your pipeline codes. The primary is to simulate faux testing knowledge primarily based on the distribution and statistics traits of the true knowledge. The second is to repeat a pattern of true knowledge into both a growth or staging atmosphere for testing functions. For the second, it is extremely vital to ensure there isn’t any knowledge privateness, safety or compliance violation when copying knowledge to an atmosphere than is much less stringent than a manufacturing atmosphere.
- Integration testing — Integration exams confirm that completely different modules or companies utilized by your knowledge pipelines work properly collectively. Probably the most crucial integration exams for knowledge pipelines are testing interactions with knowledge platforms, similar to knowledge warehouses, knowledge lakes (primarily cloud storage places) and knowledge supply functions (OLTP databases functions and SaaS functions similar to Salesforce and Workday). As everyone knows, the three key steps of a knowledge pipelines are Extract, Rework and Load. Not less than two of them (extract and cargo), generally all three of them must work together with the above talked about knowledge platforms. Moreover, your knowledge pipelines additionally must work together with a messaging system, similar to Slack or Groups to ship notifications or alerts for key occasions which have occurred in your knowledge pipelines. Subsequently it’s vital to do integration exams with all of the exterior platforms and techniques that your pipelines closely work together with.
Most knowledge pipelines are written in Python. To automate the code testing in your knowledge pipelines, you will have to leverage a Python testing framework, similar to pytest, which might be probably the most extensively used Python testing framework. It may be used to write down varied forms of software program exams, together with unit exams, integration exams, end-to-end exams, and practical exams.
Apart from a testing framework, a knowledge pipeline orchestration device will also be leveraged to make the code testing simpler in your knowledge pipelines. So far as I do know, Dagster — a pipeline orchestrator device — supplies a knowledge pipeline workflow and related features that lets you unit-test your knowledge functions, separate enterprise logic from environments, and set express expectations on uncontrolled inputs. If you recognize of different high-quality orchestration instruments that additionally present capability to make code testing simpler, please be at liberty to let me know. I’m at all times eager to be taught extra.
Now we have now coated the code testing half, let’s transfer to the info testing half, which is extra fascinating, dynamic and maybe tougher.
Let’s get began!
Information testing
As defined above, knowledge testing is about setting expectations on vital knowledge components and ensuring the info flowing by your knowledge pipelines passes these expectation checks earlier than serving the top knowledge shoppers. If there are any violations on these knowledge high quality expectations, related notification / alerting and corresponding repair actions needs to be taken.
Totally different from code testing, which is mostly performed on the compilation / deployment time, knowledge testing is performed repeatedly each time there are new streams / batches of knowledge being ingested and processed. You may consider knowledge testing as a type of steady sequence of acceptance exams. You make assumptions and expectations about newly arrived knowledge and conduct real-time testing to ensure these assumptions and expectations are met.
The explanations that knowledge testing is completely crucial are as follows:
- Firstly, knowledge is the idea of many vital enterprise choices. As increasingly organizations transfer in direction of data-driven decisioning, knowledge performs an more and more vital function within the operations of recent organizations. Subsequently having good-quality knowledge can even enhance the standard and relevance of enterprise choices.
- Secondly, knowledge at all times adjustments. Totally different from code, which is mostly static and clear, knowledge is extraordinarily dynamic. There are numerous components that will deliver adjustments to your knowledge, similar to enterprise operations adjustments, macro-economy adjustments, and the latest pandemic (covid-19). All deliver vital adjustments to the underlying knowledge. Moreover, for many eventualities, knowledge is soiled and requires some cleansing work earlier than it’s usable. Subsequently, doing knowledge testing ensures vital adjustments / drifting will be detected in time, and corrupted knowledge will be filtered and rejected appropriately.
Information testing will be roughly divided into the next classes:
- Desk-level Checks : Desk-level exams give attention to understanding the general form of a desk. Beneath are some pattern table-level exams :
#row-wise
expect_table_row_count_to_equal
expect_table_row_count_to_equal_other_table
expect_table_row_count_to_be_between
#column-wise
expect_table_column_count_to_equal
expect_table_column_count_to_be_between
expect_table_columns_to_match_ordered_list
expect_table_columns_to_match_set
- Column-level Checks : Usually talking, there are two forms of column-level exams — one is single column exams and the opposite is multi-column exams.
Single column exams are likely to give attention to setting the expectation on statistical properties of particular person columns. For numerical columns, the one column exams will verify column max, column min, column common, column distribution, and so forth. For categorical columns, the one column exams will verify the column for most typical values, column distinct values, and so forth. Multi-column exams are extra about checking the relationships between the columns. For instance, multi-columns exams anticipate the values in column A to be larger than column B. Beneath are some column-level exams:
#Single column exams
expect_column_average_to_be_within_range_of_given_point
expect_column_max_to_be_between
expect_column_min_to_be_between
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_distinct_values_to_be_in_set
expect_column_most_common_value_to_be_in_set
#Multi-coumns exams
expect_column_pair_values_to_be_equal
expect_column_pair_values_a_to_be_greater_than_b
expect_column_pair_values_to_have_diff
After all you possibly can at all times customise the precise knowledge exams primarily based on your corporation area data to ensure all of the exams are related to your knowledge and the enterprise context that your knowledge operates in.
There are just a few open-source libraries designed for supporting knowledge groups to implement knowledge high quality exams, similar to Nice Expectations (GX) , Deequ and PyDeequ.
I’ve one other article already printed, that particularly talks about the right way to leverage Nice Expectations (GX) , Deequ and PyDeequ to embed reliability into your knowledge pipelines. You will discover the article right here:
Abstract
To finish this text, you will need to reiterate the purpose of testing your knowledge pipelines is to construct extremely dependable and reliable knowledge pipelines so you possibly can ship top quality knowledge and data for downstream knowledge shoppers. In the meantime you must also keep away from a scenario of over-testing, that means you solely check the codes, options and knowledge which might be related to your knowledge pipeline high quality and the enterprise use instances that your knowledge pipelines serve for. Subsequently earlier than you write exams, you want to perceive what exams are crucial and it’s also really useful to speak to your corporation stakeholders to get their area data concerning what knowledge exams and expectations are most related and essential to them.
Please be at liberty to comply with me on Medium if you wish to be notified when these articles are printed. I typically publish 1 or 2 articles on knowledge and AI each week.
If you wish to see extra guides, deep dives, and insights round fashionable and environment friendly knowledge+AI stack, please subscribe to my free publication — Environment friendly Information+AI Stack, thanks!
Be aware: Simply in case you haven’t turn out to be a Medium member but, and also you’d wish to get limitless entry to Medium, you possibly can join utilizing my referral hyperlink!
Thanks a lot in your assist!