API Reference
Validate
When peforming data validation, you’ll need the Validate
class to get the process started. It’s given the target table and you can optionally provide some metadata and/or failure thresholds (using the Thresholds
class or through shorthands for this task). The Validate
class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data.
Validate | Workflow for defining a set of validations on a table and interrogating for results. |
Thresholds | Definition of threshold values. |
Actions | Definition of action values. |
Schema | Definition of a schema object. |
DraftValidation | Draft a validation plan for a given table using an LLM. |
Validation Steps
Validation steps can be thought of as sequential validations on the target data. We call Validate
’s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage.
Validate.col_vals_gt | Are column data greater than a fixed value or data in another column? |
Validate.col_vals_lt | Are column data less than a fixed value or data in another column? |
Validate.col_vals_ge | Are column data greater than or equal to a fixed value or data in another column? |
Validate.col_vals_le | Are column data less than or equal to a fixed value or data in another column? |
Validate.col_vals_eq | Are column data equal to a fixed value or data in another column? |
Validate.col_vals_ne | Are column data not equal to a fixed value or data in another column? |
Validate.col_vals_between | Do column data lie between two specified values or data in other columns? |
Validate.col_vals_outside | Do column data lie outside of two specified values or data in other columns? |
Validate.col_vals_in_set | Validate whether column values are in a set of values. |
Validate.col_vals_not_in_set | Validate whether column values are not in a set of values. |
Validate.col_vals_null | Validate whether values in a column are NULL. |
Validate.col_vals_not_null | Validate whether values in a column are not NULL. |
Validate.col_vals_regex | Validate whether column values match a regular expression pattern. |
Validate.col_vals_expr | Validate column values using a custom expression. |
Validate.col_exists | Validate whether one or more columns exist in the table. |
Validate.rows_distinct | Validate whether rows in the table are distinct. |
Validate.col_schema_match | Do columns in the table (and their types) match a predefined schema? |
Validate.row_count_match | Validate whether the row count of the table matches a specified count. |
Validate.col_count_match | Validate whether the column count of the table matches a specified count. |
Column Selection
A flexible way to select columns for validation is to use the col()
function along with column selection helper functions. A combination of col()
+ starts_with()
, matches()
, etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the col()
function can be used to declare a comparison column (e.g., for the value=
argument in many col_vals_*()
methods) when you can’t use a fixed value for comparison.
col | Helper function for referencing a column in the input table. |
starts_with | Select columns that start with specified text. |
ends_with | Select columns that end with specified text. |
contains | Select columns that contain specified text. |
matches | Select columns that match a specified regular expression pattern. |
everything | Select all columns. |
first_n | Select the first n columns in the column list. |
last_n | Select the last n columns in the column list. |
Interrogation and Reporting
The validation plan is put into action when interrogate()
is called. The workflow for performing a comprehensive validation is then: (1) Validate()
, (2) adding validation steps, (3) interrogate()
. After interrogation of the data, we can view a validation report table (by printing the object or using get_tabular_report()
), extract key metrics, or we can split the data based on the validation results (with get_sundered_data()
).
Validate.interrogate | Execute each validation step against the table and store the results. |
Validate.get_tabular_report | Validation report as a GT table. |
Validate.get_step_report | Get a detailed report for a single validation step. |
Validate.get_json_report | Get a report of the validation results as a JSON-formatted string. |
Validate.get_sundered_data | Get the data that passed or failed the validation steps. |
Validate.get_data_extracts | Get the rows that failed for each validation step. |
Validate.all_passed | Determine if every validation step passed perfectly, with no failing test units. |
Validate.assert_passing | Raise an AssertionError if all tests are not passing. |
Validate.n | Provides a dictionary of the number of test units for each validation step. |
Validate.n_passed | Provides a dictionary of the number of test units that passed for each validation step. |
Validate.n_failed | Provides a dictionary of the number of test units that failed for each validation step. |
Validate.f_passed | Provides a dictionary of the fraction of test units that passed for each validation step. |
Validate.f_failed | Provides a dictionary of the fraction of test units that failed for each validation step. |
Validate.warning | Get the ‘warning’ level status for each validation step. |
Validate.error | Get the ‘error’ level status for each validation step. |
Validate.critical | Get the ‘critical’ level status for each validation step. |
Inspect
The Inspect group contains functions that are helpful for getting to grips on a new data table. Use the DataScan
class to get a quick overview of the data, preview()
to see the first and last few rows of a table, missing_vals_tbl()
to see where there are missing values in a table, and get_column_count()
/get_row_count()
to get the number of columns and rows in a table. Several datasets included in the package can be accessed via the load_dataset()
function. Finally, the config()
utility lets us set global configuration parameters.
DataScan | Get a summary of a dataset. |
preview | Display a table preview that shows some rows from the top, some from the bottom. |
missing_vals_tbl | Display a table that shows the missing values in the input table. |
get_column_count | Get the number of columns in a table. |
get_row_count | Get the number of rows in a table. |
load_dataset | Load a dataset hosted in the library as specified table type. |
config | Configuration settings for the pointblank library. |