API Reference

Validate

When peforming data validation, you’ll need the Validate class to get the process started. It’s given the target table and you can optionally provide some metadata and/or failure thresholds (using the Thresholds class or through shorthands for this task). The Validate class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data.

Validate Workflow for defining a set of validations on a table and interrogating for results.
Thresholds Definition of threshold values.
Actions Definition of action values.
Schema Definition of a schema object.
DraftValidation Draft a validation plan for a given table using an LLM.

Validation Steps

Validation steps can be thought of as sequential validations on the target data. We call Validate’s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage.

Validate.col_vals_gt Are column data greater than a fixed value or data in another column?
Validate.col_vals_lt Are column data less than a fixed value or data in another column?
Validate.col_vals_ge Are column data greater than or equal to a fixed value or data in another column?
Validate.col_vals_le Are column data less than or equal to a fixed value or data in another column?
Validate.col_vals_eq Are column data equal to a fixed value or data in another column?
Validate.col_vals_ne Are column data not equal to a fixed value or data in another column?
Validate.col_vals_between Do column data lie between two specified values or data in other columns?
Validate.col_vals_outside Do column data lie outside of two specified values or data in other columns?
Validate.col_vals_in_set Validate whether column values are in a set of values.
Validate.col_vals_not_in_set Validate whether column values are not in a set of values.
Validate.col_vals_null Validate whether values in a column are NULL.
Validate.col_vals_not_null Validate whether values in a column are not NULL.
Validate.col_vals_regex Validate whether column values match a regular expression pattern.
Validate.col_vals_expr Validate column values using a custom expression.
Validate.col_exists Validate whether one or more columns exist in the table.
Validate.rows_distinct Validate whether rows in the table are distinct.
Validate.col_schema_match Do columns in the table (and their types) match a predefined schema?
Validate.row_count_match Validate whether the row count of the table matches a specified count.
Validate.col_count_match Validate whether the column count of the table matches a specified count.

Column Selection

A flexible way to select columns for validation is to use the col() function along with column selection helper functions. A combination of col() + starts_with(), matches(), etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the col() function can be used to declare a comparison column (e.g., for the value= argument in many col_vals_*() methods) when you can’t use a fixed value for comparison.

col Helper function for referencing a column in the input table.
starts_with Select columns that start with specified text.
ends_with Select columns that end with specified text.
contains Select columns that contain specified text.
matches Select columns that match a specified regular expression pattern.
everything Select all columns.
first_n Select the first n columns in the column list.
last_n Select the last n columns in the column list.

Interrogation and Reporting

The validation plan is put into action when interrogate() is called. The workflow for performing a comprehensive validation is then: (1) Validate(), (2) adding validation steps, (3) interrogate(). After interrogation of the data, we can view a validation report table (by printing the object or using get_tabular_report()), extract key metrics, or we can split the data based on the validation results (with get_sundered_data()).

Validate.interrogate Execute each validation step against the table and store the results.
Validate.get_tabular_report Validation report as a GT table.
Validate.get_step_report Get a detailed report for a single validation step.
Validate.get_json_report Get a report of the validation results as a JSON-formatted string.
Validate.get_sundered_data Get the data that passed or failed the validation steps.
Validate.get_data_extracts Get the rows that failed for each validation step.
Validate.all_passed Determine if every validation step passed perfectly, with no failing test units.
Validate.assert_passing Raise an AssertionError if all tests are not passing.
Validate.n Provides a dictionary of the number of test units for each validation step.
Validate.n_passed Provides a dictionary of the number of test units that passed for each validation step.
Validate.n_failed Provides a dictionary of the number of test units that failed for each validation step.
Validate.f_passed Provides a dictionary of the fraction of test units that passed for each validation step.
Validate.f_failed Provides a dictionary of the fraction of test units that failed for each validation step.
Validate.warning Get the ‘warning’ level status for each validation step.
Validate.error Get the ‘error’ level status for each validation step.
Validate.critical Get the ‘critical’ level status for each validation step.

Inspect

The Inspect group contains functions that are helpful for getting to grips on a new data table. Use the DataScan class to get a quick overview of the data, preview() to see the first and last few rows of a table, missing_vals_tbl() to see where there are missing values in a table, and get_column_count()/get_row_count() to get the number of columns and rows in a table. Several datasets included in the package can be accessed via the load_dataset() function. Finally, the config() utility lets us set global configuration parameters.

DataScan Get a summary of a dataset.
preview Display a table preview that shows some rows from the top, some from the bottom.
missing_vals_tbl Display a table that shows the missing values in the input table.
get_column_count Get the number of columns in a table.
get_row_count Get the number of rows in a table.
load_dataset Load a dataset hosted in the library as specified table type.
config Configuration settings for the pointblank library.