import pointblank as pb
# Load the small_table dataset
= pb.load_dataset()
small_table
# Preview the table
pb.preview(small_table)
PolarsRows13Columns8 |
||||||||
Workflow for defining a set of validations on a table and interrogating for results.
The Validate
class is used for defining a set of validation steps on a table and interrogating the table with the validation plan. This class is the main entry point for the data quality reporting workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table.
We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., col_vals_gt()
, col_vals_between()
, etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a validation plan.
The validation methods, when called, are merely instructions up to the point the concluding interrogate()
method is called. That kicks off the process of acting on the validation plan by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. For instance get_tabular_report()
method which will return a table with the results of the interrogation and get_sundered_data()
allows for the splitting of the table based on passing and failing rows.
data : FrameT | Any
The table to validate, which could be a DataFrame object or an Ibis table object. Read the Supported Input Table Types section for details on the supported table types.
tbl_name : str | None = None
A optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the HTML report generated by using the get_tabular_report()
method.
label : str | None = None
An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the HTML report).
thresholds : int | float | bool | tuple | dict | Thresholds | None = None
Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. This is to be created using one of several valid input schemes: (1) single integer/float denoting absolute number or fraction of failing test units for the ‘warn’ level, (2) a tuple of 1-3 values, (3) a dictionary of 1-3 entries, or a Thresholds
object.
: Validate
A Validate
object with the table and validations to be performed.
The data=
parameter can be given any of the following table types:
"polars"
)"pandas"
)"duckdb"
)*"mysql"
)*"postgresql"
)*"sqlite"
)*"parquet"
)*The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table
). Furthermore, the use of Validate
with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required.
Let’s walk through a data quality analysis of an extremely small table. It’s actually called small_table
and it’s accessible through the load_dataset()
function.
import pointblank as pb
# Load the small_table dataset
small_table = pb.load_dataset()
# Preview the table
pb.preview(small_table)
PolarsRows13Columns8 |
||||||||
date_time Datetime |
date Date |
a Int64 |
b String |
c Int64 |
d Float64 |
e Boolean |
f String |
|
---|---|---|---|---|---|---|---|---|
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | 1-bcd-345 | 3 | 3423.29 | True | high |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | 5-egh-163 | 8 | 9999.99 | True | low |
3 | 2016-01-05 13:32:00 | 2016-01-05 | 6 | 8-kdg-938 | 3 | 2343.23 | True | high |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | 5-jdo-903 | None | 3892.4 | False | mid |
5 | 2016-01-09 12:36:00 | 2016-01-09 | 8 | 3-ldm-038 | 7 | 283.94 | True | low |
9 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
10 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
11 | 2016-01-26 20:07:00 | 2016-01-26 | 4 | 2-dmx-010 | 7 | 833.98 | True | low |
12 | 2016-01-28 02:51:00 | 2016-01-28 | 2 | 7-dmx-010 | 8 | 108.34 | False | low |
13 | 2016-01-30 11:23:00 | 2016-01-30 | 1 | 3-dka-303 | None | 2230.09 | True | high |
We ought to think about what’s tolerable in terms of data quality so let’s designate proportional failure thresholds to the warn, stop, and notify states. This can be done by using the Thresholds
class.
Now, we use the Validate
class and give it the thresholds
object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in thresholds
will make the reporting a bit more useful. We also need to provide a target table and we’ll use small_table
for this.
Then, as with any Validate
object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the interrogate()
method.
The validation
object can be printed as a reporting table.
The report could be further customized by using the get_tabular_report()
method, which contains options for modifying the display of the table.
Furthermore, post-interrogation methods such as get_step_report()
, get_data_extracts()
, and get_sundered_data()
allow you to generate additional reporting or extract useful data for downstream analysis from a Validate
object.