import pointblank as pb
# Load the small_table dataset
= pb.load_dataset()
small_table
# Preview the table
pb.preview(small_table)
PolarsRows13Columns8 |
||||||||
Validate(
self,
data,
tbl_name=None,
label=None,
thresholds=None,
actions=None,
final_actions=None,
brief=None,
lang=None,
locale=None,
)
Workflow for defining a set of validations on a table and interrogating for results.
The Validate
class is used for defining a set of validation steps on a table and interrogating the table with the validation plan. This class is the main entry point for the data quality reporting workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table.
We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., col_vals_gt()
, col_vals_between()
, etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a validation plan.
The validation methods, when called, are merely instructions up to the point the concluding interrogate()
method is called. That kicks off the process of acting on the validation plan by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. Printing the Validate
object (or using the get_tabular_report()
method) will return a table with the results of the interrogation and get_sundered_data()
allows for the splitting of the table based on passing and failing rows.
data : FrameT
| Any
The table to validate, which could be a DataFrame object or an Ibis table object. Read the Supported Input Table Types section for details on the supported table types.
tbl_name : str
| None = None
A optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the tabular report.
label : str
| None = None
An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the tabular report).
thresholds : int
| float
| bool
| tuple
| dict
| Thresholds | None = None
Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. The thresholds are set at the global level and can be overridden at the validation step level (each validation step has its own thresholds=
parameter). The default is None
, which means that no thresholds will be set. Look at the Thresholds section for information on how to set threshold levels.
actions : Actions | None = None
The actions to take when validation steps meet or exceed any set threshold levels. These actions are paired with the threshold levels and are executed during the interrogation process when there are exceedances. The actions are executed right after each step is evaluated. Such actions should be provided in the form of an Actions
object. If None
then no global actions will be set. View the Actions section for information on how to set actions.
final_actions : FinalActions | None = None
The actions to take when the validation process is complete and the final results are available. This is useful for sending notifications or reporting the overall status of the validation process. The final actions are executed after all validation steps have been processed and the results have been collected. The final actions are not tied to any threshold levels, they are executed regardless of the validation results. Such actions should be provided in the form of a FinalActions
object. If None
then no finalizing actions will be set. Please see the Actions section for information on how to set final actions.
brief : str
| bool
| None = None
A global setting for briefs, which are optional brief descriptions for validation steps (they be displayed in the reporting table). For such a global setting, templating elements like "{step}"
(to insert the step number) or "{auto}"
(to include an automatically generated brief) are useful. If True
then each brief will be automatically generated. If None
(the default) then briefs aren’t globally set.
lang : str
| None = None
The language to use for various reporting elements. By default, None
will select English ("en"
) as the but other options include French ("fr"
), German ("de"
), Italian ("it"
), Spanish ("es"
), and several more. Have a look at the Reporting Languages section for the full list of supported languages and information on how the language setting is utilized.
locale : str
| None = None
An optional locale ID to use for formatting values in the reporting table according the locale’s rules. Examples include "en-US"
for English (United States) and "fr-FR"
for French (France). More simply, this can be a language identifier without a designation of territory, like "es"
for Spanish.
: Validate
A Validate
object with the table and validations to be performed.
The data=
parameter can be given any of the following table types:
"polars"
)"pandas"
)"duckdb"
)*"mysql"
)*"postgresql"
)*"sqlite"
)*"parquet"
)*The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table
). Furthermore, the use of Validate
with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required.
The thresholds=
parameter is used to set the failure-condition levels for all validation steps. They are set here at the global level but can be overridden at the validation step level (each validation step has its own local thresholds=
parameter).
There are three threshold levels: ‘warning’, ‘error’, and ‘critical’. The threshold values can either be set as a proportion failing of all test units (a value between 0
to 1
), or, the absolute number of failing test units (as integer that’s 1
or greater).
Thresholds can be defined using one of these input schemes:
Thresholds
class (the most direct way to create thresholds)0
is the ‘warning’ level, position 1
is the ‘error’ level, and position 2
is the ‘critical’ levelIf the number of failing test units for a validation step exceeds set thresholds, the validation step will be marked as ‘warning’, ‘error’, or ‘critical’. All of the threshold levels don’t need to be set, you’re free to set any combination of them.
Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the actions=
parameter).
The actions=
and final_actions=
parameters provide mechanisms to respond to validation results. These actions can be used to notify users of validation failures, log issues, or trigger other processes when problems are detected.
Step Actions
The actions=
parameter allows you to define actions that are triggered when validation steps exceed specific threshold levels (warning, error, or critical). These actions are executed during the interrogation process, right after each step is evaluated.
Step actions should be provided using the Actions
class, which lets you specify different actions for different severity levels:
# Define an action that logs a message when warning threshold is exceeded
def log_warning():
metadata = pb.get_action_metadata()
print(f"WARNING: Step {metadata['step']} failed with type {metadata['type']}")
# Define actions for different threshold levels
actions = pb.Actions(
warning = log_warning,
error = lambda: send_email("Error in validation"),
critical = "CRITICAL FAILURE DETECTED"
)
# Use in Validate
validation = pb.Validate(
data=my_data,
actions=actions # Global actions for all steps
)
You can also provide step-specific actions in individual validation methods:
validation.col_vals_gt(
columns="revenue",
value=0,
actions=pb.Actions(warning=log_warning) # Only applies to this step
)
Step actions have access to step-specific context through the get_action_metadata()
function, which provides details about the current validation step that triggered the action.
Final Actions
The final_actions=
parameter lets you define actions that execute after all validation steps have completed. These are useful for providing summaries, sending notifications based on overall validation status, or performing cleanup operations.
Final actions should be provided using the FinalActions
class:
def send_report():
summary = pb.get_validation_summary()
if summary["status"] == "CRITICAL":
send_alert_email(
subject=f"CRITICAL validation failures in {summary['table_name']}",
body=f"{summary['critical_steps']} steps failed with critical severity."
)
validation = pb.Validate(
data=my_data,
final_actions=pb.FinalActions(send_report)
)
Final actions have access to validation-wide summary information through the get_validation_summary()
function, which provides a comprehensive overview of the entire validation process.
The combination of step actions and final actions provides a flexible system for responding to data quality issues at both the individual step level and the overall validation level.
Various pieces of reporting in Pointblank can be localized to a specific language. This is done by setting the lang=
parameter in Validate
. Any of the following languages can be used (just provide the language code):
"en"
)"fr"
)"de"
)"it"
)"es"
)"pt"
)"nl"
)"sv"
)"da"
)"nb"
)"is"
)"fi"
)"pl"
)"cs"
)"ro"
)"el"
)"ru"
)"tr"
)"ar"
)"hi"
)"zh-Hans"
)"zh-Hant"
)"ja"
)"ko"
)"vi"
)Automatically generated briefs (produced by using brief=True
or brief="...{auto}..."
) will be written in the selected language. The language setting will also used when generating the validation report table through get_tabular_report()
(or printing the Validate
object in a notebook environment).
Let’s walk through a data quality analysis of an extremely small table. It’s actually called "small_table"
and it’s accessible through the load_dataset()
function.
import pointblank as pb
# Load the small_table dataset
small_table = pb.load_dataset()
# Preview the table
pb.preview(small_table)
PolarsRows13Columns8 |
||||||||
date_time Datetime |
date Date |
a Int64 |
b String |
c Int64 |
d Float64 |
e Boolean |
f String |
|
---|---|---|---|---|---|---|---|---|
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | 1-bcd-345 | 3 | 3423.29 | True | high |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | 5-egh-163 | 8 | 9999.99 | True | low |
3 | 2016-01-05 13:32:00 | 2016-01-05 | 6 | 8-kdg-938 | 3 | 2343.23 | True | high |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | 5-jdo-903 | None | 3892.4 | False | mid |
5 | 2016-01-09 12:36:00 | 2016-01-09 | 8 | 3-ldm-038 | 7 | 283.94 | True | low |
9 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
10 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
11 | 2016-01-26 20:07:00 | 2016-01-26 | 4 | 2-dmx-010 | 7 | 833.98 | True | low |
12 | 2016-01-28 02:51:00 | 2016-01-28 | 2 | 7-dmx-010 | 8 | 108.34 | False | low |
13 | 2016-01-30 11:23:00 | 2016-01-30 | 1 | 3-dka-303 | None | 2230.09 | True | high |
We ought to think about what’s tolerable in terms of data quality so let’s designate proportional failure thresholds to the ‘warning’, ‘error’, and ‘critical’ states. This can be done by using the Thresholds
class.
Now, we use the Validate
class and give it the thresholds
object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in thresholds=
will make the reporting a bit more useful. We also need to provide a target table and we’ll use small_table
for this.
Then, as with any Validate
object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the interrogate()
method.
The validation
object can be printed as a reporting table.
The report could be further customized by using the get_tabular_report()
method, which contains options for modifying the display of the table.
Briefs are short descriptions of the validation steps. While they can be set for each step individually, they can also be set globally. The global setting is done by using the brief=
argument in Validate
. The global setting can be as simple as True
to have automatically-generated briefs for each step. Alternatively, we can use templating elements like "{step}"
(to insert the step number) or "{auto}"
(to include an automatically generated brief). Here’s an example of a global setting for briefs:
validation = (
pb.Validate(
data=pb.load_dataset(),
tbl_name="small_table",
label="Validation example with briefs",
brief="Step {step}: {auto}",
)
.col_vals_gt(columns="d", value=100)
.col_vals_between(columns="c", left=3, right=10, na_pass=True)
.col_vals_regex(
columns="b",
pattern=r"[0-9]-[a-z]{3}-[0-9]{3}",
brief="Regex check for column {col}"
)
.interrogate()
)
validation
We see the text of the briefs appear in the STEP
column of the reporting table. Furthermore, the global brief’s template ("Step {step}: {auto}"
) is applied to all steps except for the final step, where the step-level brief=
argument provided an override.
If you should want to cancel the globally-defined brief for one or more validation steps, you can set brief=False
in those particular steps.
The Validate
class has a number of post-interrogation methods that can be used to extract useful information from the validation results. For example, the get_data_extracts()
method can be used to get the data extracts for each validation step.
{1: shape: (0, 9)
┌───────────┬──────────────┬──────┬─────┬───┬─────┬─────┬──────┬─────┐
│ _row_num_ ┆ date_time ┆ date ┆ a ┆ … ┆ c ┆ d ┆ e ┆ f │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ datetime[μs] ┆ date ┆ i64 ┆ ┆ i64 ┆ f64 ┆ bool ┆ str │
╞═══════════╪══════════════╪══════╪═════╪═══╪═════╪═════╪══════╪═════╡
└───────────┴──────────────┴──────┴─────┴───┴─────┴─────┴──────┴─────┘,
2: shape: (1, 9)
┌───────────┬─────────────────────┬────────────┬─────┬───┬─────┬─────────┬───────┬─────┐
│ _row_num_ ┆ date_time ┆ date ┆ a ┆ … ┆ c ┆ d ┆ e ┆ f │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ datetime[μs] ┆ date ┆ i64 ┆ ┆ i64 ┆ f64 ┆ bool ┆ str │
╞═══════════╪═════════════════════╪════════════╪═════╪═══╪═════╪═════════╪═══════╪═════╡
│ 8 ┆ 2016-01-17 11:27:00 ┆ 2016-01-17 ┆ 4 ┆ … ┆ 2 ┆ 1035.64 ┆ false ┆ low │
└───────────┴─────────────────────┴────────────┴─────┴───┴─────┴─────────┴───────┴─────┘,
3: shape: (0, 9)
┌───────────┬──────────────┬──────┬─────┬───┬─────┬─────┬──────┬─────┐
│ _row_num_ ┆ date_time ┆ date ┆ a ┆ … ┆ c ┆ d ┆ e ┆ f │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ datetime[μs] ┆ date ┆ i64 ┆ ┆ i64 ┆ f64 ┆ bool ┆ str │
╞═══════════╪══════════════╪══════╪═════╪═══╪═════╪═════╪══════╪═════╡
└───────────┴──────────────┴──────┴─────┴───┴─────┴─────┴──────┴─────┘}
We can also view step reports for each validation step using the get_step_report()
method. This method adapts to the type of validation step and shows the relevant information for a step’s validation.
Report for Validation Step 2 ASSERTION
1 / 13 TEST UNIT FAILURES IN COLUMN 5 EXTRACT OF ALL 1 ROWS (WITH TEST UNIT FAILURES IN RED): |
||||||||
date_time Datetime |
date Date |
a Int64 |
b String |
c Int64 |
d Float64 |
e Boolean |
f String |
|
---|---|---|---|---|---|---|---|---|
8 | 2016-01-17 11:27:00 | 2016-01-17 | 4 | 5-boe-639 | 2 | 1035.64 | False | low |
The Validate
class also has a method for getting the sundered data, which is the data that passed or failed the validation steps. This can be done using the get_sundered_data()
method.
PolarsRows12Columns8 |
||||||||
date_time Datetime |
date Date |
a Int64 |
b String |
c Int64 |
d Float64 |
e Boolean |
f String |
|
---|---|---|---|---|---|---|---|---|
1 | 2016-01-04 11:00:00 | 2016-01-04 | 2 | 1-bcd-345 | 3 | 3423.29 | True | high |
2 | 2016-01-04 00:32:00 | 2016-01-04 | 3 | 5-egh-163 | 8 | 9999.99 | True | low |
3 | 2016-01-05 13:32:00 | 2016-01-05 | 6 | 8-kdg-938 | 3 | 2343.23 | True | high |
4 | 2016-01-06 17:23:00 | 2016-01-06 | 2 | 5-jdo-903 | None | 3892.4 | False | mid |
5 | 2016-01-09 12:36:00 | 2016-01-09 | 8 | 3-ldm-038 | 7 | 283.94 | True | low |
8 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
9 | 2016-01-20 04:30:00 | 2016-01-20 | 3 | 5-bce-642 | 9 | 837.93 | False | high |
10 | 2016-01-26 20:07:00 | 2016-01-26 | 4 | 2-dmx-010 | 7 | 833.98 | True | low |
11 | 2016-01-28 02:51:00 | 2016-01-28 | 2 | 7-dmx-010 | 8 | 108.34 | False | low |
12 | 2016-01-30 11:23:00 | 2016-01-30 | 1 | 3-dka-303 | None | 2230.09 | True | high |
The sundered data is a DataFrame that contains the rows that passed or failed the validation. The default behavior is to return the rows that failed the validation, as shown above.