Validate

Validate(
    self,
    data,
    tbl_name=None,
    label=None,
    thresholds=None,
    actions=None,
    final_actions=None,
    brief=None,
    lang=None,
    locale=None,
)

Workflow for defining a set of validations on a table and interrogating for results.

The Validate class is used for defining a set of validation steps on a table and interrogating the table with the validation plan. This class is the main entry point for the data quality reporting workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table.

We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., col_vals_gt(), col_vals_between(), etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a validation plan.

The validation methods, when called, are merely instructions up to the point the concluding interrogate() method is called. That kicks off the process of acting on the validation plan by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. Printing the Validate object (or using the get_tabular_report() method) will return a table with the results of the interrogation and get_sundered_data() allows for the splitting of the table based on passing and failing rows.

Parameters

data : FrameT | Any

The table to validate, which could be a DataFrame object or an Ibis table object. Read the Supported Input Table Types section for details on the supported table types.

tbl_name : str | None = None

A optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the tabular report.

label : str | None = None

An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the tabular report).

thresholds : int | float | bool | tuple | dict | Thresholds | None = None

Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. The thresholds are set at the global level and can be overridden at the validation step level (each validation step has its own thresholds= parameter). The default is None, which means that no thresholds will be set. Look at the Thresholds section for information on how to set threshold levels.

actions : Actions | None = None

The actions to take when validation steps meet or exceed any set threshold levels. These actions are paired with the threshold levels and are executed during the interrogation process when there are exceedances. The actions are executed right after each step is evaluated. Such actions should be provided in the form of an Actions object. If None then no global actions will be set. View the Actions section for information on how to set actions.

final_actions : FinalActions | None = None

The actions to take when the validation process is complete and the final results are available. This is useful for sending notifications or reporting the overall status of the validation process. The final actions are executed after all validation steps have been processed and the results have been collected. The final actions are not tied to any threshold levels, they are executed regardless of the validation results. Such actions should be provided in the form of a FinalActions object. If None then no finalizing actions will be set. Please see the Actions section for information on how to set final actions.

brief : str | bool | None = None

A global setting for briefs, which are optional brief descriptions for validation steps (they be displayed in the reporting table). For such a global setting, templating elements like "{step}" (to insert the step number) or "{auto}" (to include an automatically generated brief) are useful. If True then each brief will be automatically generated. If None (the default) then briefs aren’t globally set.

lang : str | None = None

The language to use for various reporting elements. By default, None will select English ("en") as the but other options include French ("fr"), German ("de"), Italian ("it"), Spanish ("es"), and several more. Have a look at the Reporting Languages section for the full list of supported languages and information on how the language setting is utilized.

locale : str | None = None

An optional locale ID to use for formatting values in the reporting table according the locale’s rules. Examples include "en-US" for English (United States) and "fr-FR" for French (France). More simply, this can be a language identifier without a designation of territory, like "es" for Spanish.

Returns

: Validate

A Validate object with the table and validations to be performed.

Supported Input Table Types

The data= parameter can be given any of the following table types:

  • Polars DataFrame ("polars")
  • Pandas DataFrame ("pandas")
  • DuckDB table ("duckdb")*
  • MySQL table ("mysql")*
  • PostgreSQL table ("postgresql")*
  • SQLite table ("sqlite")*
  • Parquet table ("parquet")*

The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, the use of Validate with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required.

Thresholds

The thresholds= parameter is used to set the failure-condition levels for all validation steps. They are set here at the global level but can be overridden at the validation step level (each validation step has its own local thresholds= parameter).

There are three threshold levels: ‘warning’, ‘error’, and ‘critical’. The threshold values can either be set as a proportion failing of all test units (a value between 0 to 1), or, the absolute number of failing test units (as integer that’s 1 or greater).

Thresholds can be defined using one of these input schemes:

  1. use the Thresholds class (the most direct way to create thresholds)
  2. provide a tuple of 1-3 values, where position 0 is the ‘warning’ level, position 1 is the ‘error’ level, and position 2 is the ‘critical’ level
  3. create a dictionary of 1-3 value entries; the valid keys: are ‘warning’, ‘error’, and ‘critical’
  4. a single integer/float value denoting absolute number or fraction of failing test units for the ‘warning’ level only

If the number of failing test units for a validation step exceeds set thresholds, the validation step will be marked as ‘warning’, ‘error’, or ‘critical’. All of the threshold levels don’t need to be set, you’re free to set any combination of them.

Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the actions= parameter).

Actions

The actions= and final_actions= parameters provide mechanisms to respond to validation results. These actions can be used to notify users of validation failures, log issues, or trigger other processes when problems are detected.

Step Actions

The actions= parameter allows you to define actions that are triggered when validation steps exceed specific threshold levels (warning, error, or critical). These actions are executed during the interrogation process, right after each step is evaluated.

Step actions should be provided using the Actions class, which lets you specify different actions for different severity levels:

# Define an action that logs a message when warning threshold is exceeded
def log_warning():
    metadata = pb.get_action_metadata()
    print(f"WARNING: Step {metadata['step']} failed with type {metadata['type']}")

# Define actions for different threshold levels
actions = pb.Actions(
    warning = log_warning,
    error = lambda: send_email("Error in validation"),
    critical = "CRITICAL FAILURE DETECTED"
)

# Use in Validate
validation = pb.Validate(
    data=my_data,
    actions=actions  # Global actions for all steps
)

You can also provide step-specific actions in individual validation methods:

validation.col_vals_gt(
    columns="revenue",
    value=0,
    actions=pb.Actions(warning=log_warning)  # Only applies to this step
)

Step actions have access to step-specific context through the get_action_metadata() function, which provides details about the current validation step that triggered the action.

Final Actions

The final_actions= parameter lets you define actions that execute after all validation steps have completed. These are useful for providing summaries, sending notifications based on overall validation status, or performing cleanup operations.

Final actions should be provided using the FinalActions class:

def send_report():
    summary = pb.get_validation_summary()
    if summary["status"] == "CRITICAL":
        send_alert_email(
            subject=f"CRITICAL validation failures in {summary['table_name']}",
            body=f"{summary['critical_steps']} steps failed with critical severity."
        )

validation = pb.Validate(
    data=my_data,
    final_actions=pb.FinalActions(send_report)
)

Final actions have access to validation-wide summary information through the get_validation_summary() function, which provides a comprehensive overview of the entire validation process.

The combination of step actions and final actions provides a flexible system for responding to data quality issues at both the individual step level and the overall validation level.

Reporting Languages

Various pieces of reporting in Pointblank can be localized to a specific language. This is done by setting the lang= parameter in Validate. Any of the following languages can be used (just provide the language code):

  • English ("en")
  • French ("fr")
  • German ("de")
  • Italian ("it")
  • Spanish ("es")
  • Portuguese ("pt")
  • Dutch ("nl")
  • Swedish ("sv")
  • Danish ("da")
  • Norwegian Bokmål ("nb")
  • Icelandic ("is")
  • Finnish ("fi")
  • Polish ("pl")
  • Czech ("cs")
  • Romanian ("ro")
  • Greek ("el")
  • Russian ("ru")
  • Turkish ("tr")
  • Arabic ("ar")
  • Hindi ("hi")
  • Simplified Chinese ("zh-Hans")
  • Traditional Chinese ("zh-Hant")
  • Japanese ("ja")
  • Korean ("ko")
  • Vietnamese ("vi")

Automatically generated briefs (produced by using brief=True or brief="...{auto}...") will be written in the selected language. The language setting will also used when generating the validation report table through get_tabular_report() (or printing the Validate object in a notebook environment).

Examples

Creating a validation plan and interrogating

Let’s walk through a data quality analysis of an extremely small table. It’s actually called "small_table" and it’s accessible through the load_dataset() function.

import pointblank as pb

# Load the small_table dataset
small_table = pb.load_dataset()

# Preview the table
pb.preview(small_table)
PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

We ought to think about what’s tolerable in terms of data quality so let’s designate proportional failure thresholds to the ‘warning’, ‘error’, and ‘critical’ states. This can be done by using the Thresholds class.

thresholds = pb.Thresholds(warning=0.10, error=0.25, critical=0.35)

Now, we use the Validate class and give it the thresholds object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in thresholds= will make the reporting a bit more useful. We also need to provide a target table and we’ll use small_table for this.

validation = (
    pb.Validate(
        data=small_table,
        tbl_name="small_table",
        label="`Validate` example.",
        thresholds=thresholds
    )
)

Then, as with any Validate object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the interrogate() method.

validation = (
    validation
    .col_vals_gt(columns="d", value=100)
    .col_vals_le(columns="c", value=5)
    .col_vals_between(columns="c", left=3, right=10, na_pass=True)
    .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}")
    .col_exists(columns=["date", "date_time"])
    .interrogate()
)

The validation object can be printed as a reporting table.

validation
Pointblank Validation
`Validate` example.
Polarssmall_tableWARNING0.1ERROR0.25CRITICAL0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
d 100 13 13
1.00
0
0.00
#FF3300 2
col_vals_lte
col_vals_le()
c 5 13 5
0.38
8
0.62
#4CA64C66 3
col_vals_between
col_vals_between()
c [3, 10] 13 12
0.92
1
0.08
#4CA64C 4
col_vals_regex
col_vals_regex()
b [0-9]-[a-z]{3}-[0-9]{3} 13 13
1.00
0
0.00
#4CA64C 5
col_exists
col_exists()
date 1 1
1.00
0
0.00
#4CA64C 6
col_exists
col_exists()
date_time 1 1
1.00
0
0.00
2025-04-15 14:44:03 UTC< 1 s2025-04-15 14:44:03 UTC

The report could be further customized by using the get_tabular_report() method, which contains options for modifying the display of the table.

Adding briefs

Briefs are short descriptions of the validation steps. While they can be set for each step individually, they can also be set globally. The global setting is done by using the brief= argument in Validate. The global setting can be as simple as True to have automatically-generated briefs for each step. Alternatively, we can use templating elements like "{step}" (to insert the step number) or "{auto}" (to include an automatically generated brief). Here’s an example of a global setting for briefs:

validation = (
    pb.Validate(
        data=pb.load_dataset(),
        tbl_name="small_table",
        label="Validation example with briefs",
        brief="Step {step}: {auto}",
    )
    .col_vals_gt(columns="d", value=100)
    .col_vals_between(columns="c", left=3, right=10, na_pass=True)
    .col_vals_regex(
        columns="b",
        pattern=r"[0-9]-[a-z]{3}-[0-9]{3}",
        brief="Regex check for column {col}"
    )
    .interrogate()
)

validation
Pointblank Validation
Validation example with briefs
Polarssmall_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()

Step 1: Expect that values in d should be > 100.

d 100 13 13
1.00
0
0.00
#4CA64C66 2
col_vals_between
col_vals_between()

Step 2: Expect that values in c should be between 3 and 10.

c [3, 10] 13 12
0.92
1
0.08
#4CA64C 3
col_vals_regex
col_vals_regex()

Regex check for column b

b [0-9]-[a-z]{3}-[0-9]{3} 13 13
1.00
0
0.00
2025-04-15 14:44:03 UTC< 1 s2025-04-15 14:44:03 UTC

We see the text of the briefs appear in the STEP column of the reporting table. Furthermore, the global brief’s template ("Step {step}: {auto}") is applied to all steps except for the final step, where the step-level brief= argument provided an override.

If you should want to cancel the globally-defined brief for one or more validation steps, you can set brief=False in those particular steps.

Post-interrogation methods

The Validate class has a number of post-interrogation methods that can be used to extract useful information from the validation results. For example, the get_data_extracts() method can be used to get the data extracts for each validation step.

validation.get_data_extracts()
{1: shape: (0, 9)
 ┌───────────┬──────────────┬──────┬─────┬───┬─────┬─────┬──────┬─────┐
 │ _row_num_ ┆ date_time    ┆ date ┆ a   ┆ … ┆ c   ┆ d   ┆ e    ┆ f   │
 │ ---       ┆ ---          ┆ ---  ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ --- │
 │ u32       ┆ datetime[μs] ┆ date ┆ i64 ┆   ┆ i64 ┆ f64 ┆ bool ┆ str │
 ╞═══════════╪══════════════╪══════╪═════╪═══╪═════╪═════╪══════╪═════╡
 └───────────┴──────────────┴──────┴─────┴───┴─────┴─────┴──────┴─────┘,
 2: shape: (1, 9)
 ┌───────────┬─────────────────────┬────────────┬─────┬───┬─────┬─────────┬───────┬─────┐
 │ _row_num_ ┆ date_time           ┆ date       ┆ a   ┆ … ┆ c   ┆ d       ┆ e     ┆ f   │
 │ ---       ┆ ---                 ┆ ---        ┆ --- ┆   ┆ --- ┆ ---     ┆ ---   ┆ --- │
 │ u32       ┆ datetime[μs]        ┆ date       ┆ i64 ┆   ┆ i64 ┆ f64     ┆ bool  ┆ str │
 ╞═══════════╪═════════════════════╪════════════╪═════╪═══╪═════╪═════════╪═══════╪═════╡
 │ 8         ┆ 2016-01-17 11:27:00 ┆ 2016-01-17 ┆ 4   ┆ … ┆ 2   ┆ 1035.64 ┆ false ┆ low │
 └───────────┴─────────────────────┴────────────┴─────┴───┴─────┴─────────┴───────┴─────┘,
 3: shape: (0, 9)
 ┌───────────┬──────────────┬──────┬─────┬───┬─────┬─────┬──────┬─────┐
 │ _row_num_ ┆ date_time    ┆ date ┆ a   ┆ … ┆ c   ┆ d   ┆ e    ┆ f   │
 │ ---       ┆ ---          ┆ ---  ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ --- │
 │ u32       ┆ datetime[μs] ┆ date ┆ i64 ┆   ┆ i64 ┆ f64 ┆ bool ┆ str │
 ╞═══════════╪══════════════╪══════╪═════╪═══╪═════╪═════╪══════╪═════╡
 └───────────┴──────────────┴──────┴─────┴───┴─────┴─────┴──────┴─────┘}

We can also view step reports for each validation step using the get_step_report() method. This method adapts to the type of validation step and shows the relevant information for a step’s validation.

validation.get_step_report(i=2)
Report for Validation Step 2
ASSERTION 3 ≤ c ≤ 10
1 / 13 TEST UNIT FAILURES IN COLUMN 5
EXTRACT OF ALL 1 ROWS (WITH TEST UNIT FAILURES IN RED):
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low

The Validate class also has a method for getting the sundered data, which is the data that passed or failed the validation steps. This can be done using the get_sundered_data() method.

pb.preview(validation.get_sundered_data())
PolarsRows12Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
8 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
11 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
12 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

The sundered data is a DataFrame that contains the rows that passed or failed the validation. The default behavior is to return the rows that failed the validation, as shown above.