Validate, profile, and doc your knowledge
I used to work for a retail analytics firm the place we offered analytical options to assist retailers enhance their companies comparable to stock and allocation optimization, demand forecasting, and dynamic pricing.
A typical workflow begins from a every day feed from the shopper, which is the uncooked knowledge used as enter for our options. After a sequence of information cleansing, manipulation, evaluation, and modeling steps, outcomes are created and despatched to the shopper.
One of many predominant challenges in such processes is the validation of information coming from the shopper. If it comprises some surprising or absurd values, the outcomes won’t be helpful. In actual fact, they may do extra hurt than good.
If these issues are detected within the consequence step, the affect simply accelerates. You’ll most likely must rerun the pipeline, which suggests additional price and a waste of time. A worse case situation could be sending the outcomes to the shopper, who then makes use of them of their operation.
Fortunately, we now have lots of instruments to forestall such disasters from taking place. Nice Expectations is one in every of them. It’s a Python library for validation, documenting, and profiling your knowledge to keep up high quality and enhance communication between groups.
Nice Expectations permits for asserting what you count on from the info, which helps catch knowledge points shortly and at an early step.
The principle element of the library is Expectation, which is a declarative assertion that may be evaluated by a pc. Expectations are mainly unit assessments in your knowledge.
The Expectations are assigned intuitive names which clearly tells us what they’re about. Right here is an instance:
expect_column_values_to_be_between(
column="value", min_value=1, max_value=10000
)
What this Expectation does is to examine if the values within the column are between the required minimal and most values.
There are lots of Expectations outlined within the core library. Nonetheless, we aren’t restricted to or depending on solely these.
Nice Expectations library has many extra Expectations contributed by the group.
We will set up it by way of pip as follows:
pip set up great_expectations
Then, we are able to import it:
import great_expectations as ge
Let’s do some examples utilizing a gross sales dataset I ready with mock knowledge. You’ll be able to obtain it from the datasets repository on my GitHub web page. It’s referred to as “sales_data_with_stores”.
So as to use the Expectations, we’d like a Nice Expectations dataset. Now we have two alternative ways to create it:
- From a Pandas DataFrame utilizing the from_pandas perform
- From a CSV file utilizing the read_csv perform of Nice Expectations
import great_expectations as gedf = ge.read_csv("datasets/sales_data_with_stores.csv")sort(df)
great_expectations.dataset.pandas_dataset.PandasDatasetdf.head()
Expectation 1
So as to catch an surprising worth in a column with distinct values, we are able to use the expect_column_distinct_values_to_be_in_set expectation. It checks if all of the values within the column are within the given set.
Let’s apply it to the shop column.
df.expect_column_distinct_values_to_be_in_set(
"retailer",
["Violet", "Rose"]
)# output
{
"consequence": {
"observed_value": [
"Daisy",
"Rose",
"Violet"
],
"element_count": 1000,
"missing_count": null,
"missing_percent": null
},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"meta": {},
"success": false
}
The expectation fails (i.e. success: false) as a result of we now have a worth (Daisy) within the retailer column that isn’t within the given record.
Along with indicating success and failure, the output of an Expectation comprises another items of data such because the noticed values, variety of values, and lacking values within the column.
Expectation 2
We will examine if the utmost worth of a column is between a selected vary:
df.expect_column_max_to_be_between(
"value",
min_value=0.1,
max_value=2000
)# output
{
"consequence": {
"observed_value": 1500.05,
"element_count": 1000,
"missing_count": null,
"missing_percent": null
},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"meta": {},
"success": true
}
The output is in a dictionary format so we are able to simply use a selected a part of it and use it in our pipelines.
max_check = df.expect_column_max_to_be_between(
"value",
min_value=0.1,
max_value=2000
)max_check["success"]# output
True
Expectation 3
Uniqueness of worth is vital for some options comparable to an id column. We will examine if all of the values in a column are distinctive.
# for a single column
df.expect_column_values_to_be_unique("product_code")# for a number of columns
df.expect_compound_columns_to_be_unique(
column_list=["product_code","product_group"]
)
The outputs of those Expectations are fairly lengthy so I’m not displaying them right here however they embrace precious insights such because the variety of surprising values and a partial surprising worth record.
Expectation 4
A easy but helpful expectation is to examine if a specific column exists within the dataset.
df.expect_column_to_exist("price")# output
{
"consequence": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"meta": {},
"success": true
}
This turns out to be useful once you wish to be sure that the every day knowledge feed comprises all the required columns.
Conclusion
Now we have performed solely 3 examples however there are at the moment 297 expectations within the library and this quantity is growing.
One of many issues I actually like about these expectations is that the names are self-explanatory in order that it’s fairly simple to know what they do.
It’s possible you’ll argue that these expectations will be examine utilizing pure Python code or another packages. You might be proper however there are some benefits of utilizing the Nice Expectations library:
- Simple-to-implement
- Have an ordinary and extremely intuitive syntax
- Some expectations will not be quite simple and require writing many traces of code if you happen to favor to do it by yourself
- Final however not least, Nice Expectations additionally creates knowledge documentation and knowledge high quality studies from these Expectations.
You’ll be able to turn into a Medium member to unlock full entry to my writing, plus the remainder of Medium. Should you already are, don’t overlook to subscribe if you happen to’d prefer to get an e mail every time I publish a brand new article.
Thanks for studying. Please let me know if in case you have any suggestions.