API Reference

Validate

When performing data validation, you’ll need the Validate class to get the process started. It’s given the target table and you can optionally provide some metadata and/or failure thresholds (using the Thresholds class or through shorthands for this task). The Validate class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data.

Validate: Workflow for defining a set of validations on a table and interrogating for results.
Thresholds: Definition of threshold values.
Actions: Definition of action values.
FinalActions: Define actions to be taken after validation is complete.
Schema: Definition of a schema object.
DraftValidation: Draft a validation plan for a given table using an LLM.

Validation Steps

Validation steps can be thought of as sequential validations on the target data. We call Validate’s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage.

Validate.col_vals_gt(): Are column data greater than a fixed value or data in another column?
Validate.col_vals_lt(): Are column data less than a fixed value or data in another column?
Validate.col_vals_ge(): Are column data greater than or equal to a fixed value or data in another column?
Validate.col_vals_le(): Are column data less than or equal to a fixed value or data in another column?
Validate.col_vals_eq(): Are column data equal to a fixed value or data in another column?
Validate.col_vals_ne(): Are column data not equal to a fixed value or data in another column?
Validate.col_vals_between(): Do column data lie between two specified values or data in other columns?
Validate.col_vals_outside(): Do column data lie outside of two specified values or data in other columns?
Validate.col_vals_in_set(): Validate whether column values are in a set of values.
Validate.col_vals_not_in_set(): Validate whether column values are not in a set of values.
Validate.col_vals_increasing(): Are column data increasing by row?
Validate.col_vals_decreasing(): Are column data decreasing by row?
Validate.col_vals_null(): Validate whether values in a column are Null.
Validate.col_vals_not_null(): Validate whether values in a column are not Null.
Validate.col_vals_regex(): Validate whether column values match a regular expression pattern.
Validate.col_vals_within_spec(): Validate whether column values fit within a specification.
Validate.col_vals_expr(): Validate column values using a custom expression.
Validate.rows_distinct(): Validate whether rows in the table are distinct.
Validate.rows_complete(): Validate whether row data are complete by having no missing values.
Validate.col_exists(): Validate whether one or more columns exist in the table.
Validate.col_pct_null(): Validate whether a column has a specific percentage of Null values.
Validate.col_schema_match(): Do columns in the table (and their types) match a predefined schema?
Validate.row_count_match(): Validate whether the row count of the table matches a specified count.
Validate.col_count_match(): Validate whether the column count of the table matches a specified count.
Validate.tbl_match(): Validate whether the target table matches a comparison table.
Validate.conjointly(): Perform multiple row-wise validations for joint validity.
Validate.specially(): Perform a specialized validation with customized logic.
Validate.prompt(): Validate rows using AI/LLM-powered analysis.

Column Selection

A flexible way to select columns for validation is to use the col() function along with column selection helper functions. A combination of col() + starts_with(), matches(), etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the col() function can be used to declare a comparison column (e.g., for the value= argument in many col_vals_*() methods) when you can’t use a fixed value for comparison.

col(): Helper function for referencing a column in the input table.
starts_with(): Select columns that start with specified text.
ends_with(): Select columns that end with specified text.
contains(): Select columns that contain specified text.
matches(): Select columns that match a specified regular expression pattern.
everything(): Select all columns.
first_n(): Select the first n columns in the column list.
last_n(): Select the last n columns in the column list.
expr_col(): Create a column expression for use in conjointly() validation.

Segment Groups

Combine multiple values into a single segment using seg_*() helper functions.

seg_group(): Group together values for segmentation.

Interrogation and Reporting

The validation plan is put into action when interrogate() is called. The workflow for performing a comprehensive validation is then: (1) Validate(), (2) adding validation steps, (3) interrogate(). After interrogation of the data, we can view a validation report table (by printing the object or using get_tabular_report()), extract key metrics, or we can split the data based on the validation results (with get_sundered_data()).

Validate.interrogate(): Execute each validation step against the table and store the results.
Validate.set_tbl(): Set or replace the table associated with the Validate object.
Validate.get_tabular_report(): Validation report as a GT table.
Validate.get_step_report(): Get a detailed report for a single validation step.
Validate.get_json_report(): Get a report of the validation results as a JSON-formatted string.
Validate.get_sundered_data(): Get the data that passed or failed the validation steps.
Validate.get_data_extracts(): Get the rows that failed for each validation step.
Validate.all_passed(): Determine if every validation step passed perfectly, with no failing test units.
Validate.assert_passing(): Raise an AssertionError if all tests are not passing.
Validate.assert_below_threshold(): Raise an AssertionError if validation steps exceed a specified threshold level.
Validate.above_threshold(): Check if any validation steps exceed a specified threshold level.
Validate.n(): Provides a dictionary of the number of test units for each validation step.
Validate.n_passed(): Provides a dictionary of the number of test units that passed for each validation step.
Validate.n_failed(): Provides a dictionary of the number of test units that failed for each validation step.
Validate.f_passed(): Provides a dictionary of the fraction of test units that passed for each validation step.
Validate.f_failed(): Provides a dictionary of the fraction of test units that failed for each validation step.
Validate.warning(): Get the ‘warning’ level status for each validation step.
Validate.error(): Get the ‘error’ level status for each validation step.
Validate.critical(): Get the ‘critical’ level status for each validation step.

Inspection and Assistance

The Inspection and Assistance group contains functions that are helpful for getting to grips on a new data table. Use the DataScan class to get a quick overview of the data, preview() to see the first and last few rows of a table, col_summary_tbl() for a column-level summary of a table, and missing_vals_tbl() to see where there are missing values in a table. Several datasets included in the package can be accessed via the load_dataset() function. On the assistance side, the assistant() function can be used to get help with Pointblank.

DataScan: Get a summary of a dataset.
preview(): Display a table preview that shows some rows from the top, some from the bottom.
col_summary_tbl(): Generate a column-level summary table of a dataset.
missing_vals_tbl(): Display a table that shows the missing values in the input table.
assistant(): Chat with the PbA (Pointblank Assistant) about your data validation needs.
load_dataset(): Load a dataset hosted in the library as specified table type.
get_data_path(): Get the file path to a dataset included with the Pointblank package.
connect_to_table(): Connect to a database table using a connection string.
print_database_tables(): List all tables in a database from a connection string.

YAML

The YAML group contains functions that allow for the use of YAML to orchestrate validation workflows. The yaml_interrogate() function can be used to run a validation workflow from YAML strings or files. The validate_yaml() function checks if the YAML configuration passes its own validity checks. The yaml_to_python() function converts YAML configuration to equivalent Python code.

yaml_interrogate(): Execute a YAML-based validation workflow.
validate_yaml(): Validate YAML configuration against the expected structure.
yaml_to_python(): Convert YAML validation configuration to equivalent Python code.

Utility Functions

The Utility Functions group contains functions that are useful accessing metadata about the target data. Use get_column_count() or get_row_count() to get the number of columns or rows in a table. The get_action_metadata() function is useful when building custom actions since it returns metadata about the validation step that’s triggering the action. Lastly, the config() utility lets us set global configuration parameters.

get_column_count(): Get the number of columns in a table.
get_row_count(): Get the number of rows in a table.
get_action_metadata(): Access step-level metadata when authoring custom actions.
get_validation_summary(): Access validation summary information when authoring final actions.
write_file(): Write a Validate object to disk as a serialized file.
read_file(): Read a Validate object from disk that was previously saved with write_file().
config(): Configuration settings for the Pointblank library.

Prebuilt Actions

The Prebuilt Actions group contains a function that can be used to send a Slack notification when validation steps exceed failure threshold levels or just to provide a summary of the validation results, including the status, number of steps, passing and failing steps, table information, and timing details.

send_slack_notification(): Create a Slack notification function using a webhook URL.