---------------------------------------------------------------------- This is the API documentation for the Pointblank library. ---------------------------------------------------------------------- ## The Validate family When peforming data validation, you'll need the `Validate` class to get the process started. It's given the target table and you can optionally provide some metadata and/or failure thresholds (using the `Thresholds` class or through shorthands for this task). The `Validate` class has numerous methods for defining validation steps and for obtaining post-interrogation metrics and data. Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds | None' = None, actions: 'Actions | None' = None, final_actions: 'FinalActions | None' = None, brief: 'str | bool | None' = None, lang: 'str | None' = None, locale: 'str | None' = None) -> None Workflow for defining a set of validations on a table and interrogating for results. The `Validate` class is used for defining a set of validation steps on a table and interrogating the table with the *validation plan*. This class is the main entry point for the *data quality reporting* workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table. We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_between()`](`pointblank.Validate.col_vals_between`), etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a *validation plan*. The validation methods, when called, are merely instructions up to the point the concluding [`interrogate()`](`pointblank.Validate.interrogate`) method is called. That kicks off the process of acting on the *validation plan* by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. Printing the `Validate` object (or using the [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method) will return a table with the results of the interrogation and [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) allows for the splitting of the table based on passing and failing rows. Parameters ---------- data The table to validate, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a database connection string. When providing a CSV or Parquet file path (as a string or `pathlib.Path` object), the file will be automatically loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports glob patterns, directories containing .parquet files, and Spark-style partitioned datasets. GitHub URLs are automatically transformed to raw content URLs and downloaded. Connection strings enable direct database access via Ibis with optional table specification using the `::table_name` suffix. Read the *Supported Input Table Types* section for details on the supported table types. tbl_name An optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the tabular report. label An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the tabular report). thresholds Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. The thresholds are set at the global level and can be overridden at the validation step level (each validation step has its own `thresholds=` parameter). The default is `None`, which means that no thresholds will be set. Look at the *Thresholds* section for information on how to set threshold levels. actions The actions to take when validation steps meet or exceed any set threshold levels. These actions are paired with the threshold levels and are executed during the interrogation process when there are exceedances. The actions are executed right after each step is evaluated. Such actions should be provided in the form of an `Actions` object. If `None` then no global actions will be set. View the *Actions* section for information on how to set actions. final_actions The actions to take when the validation process is complete and the final results are available. This is useful for sending notifications or reporting the overall status of the validation process. The final actions are executed after all validation steps have been processed and the results have been collected. The final actions are not tied to any threshold levels, they are executed regardless of the validation results. Such actions should be provided in the form of a `FinalActions` object. If `None` then no finalizing actions will be set. Please see the *Actions* section for information on how to set final actions. brief A global setting for briefs, which are optional brief descriptions for validation steps (they be displayed in the reporting table). For such a global setting, templating elements like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated brief) are useful. If `True` then each brief will be automatically generated. If `None` (the default) then briefs aren't globally set. lang The language to use for various reporting elements. By default, `None` will select English (`"en"`) as the but other options include French (`"fr"`), German (`"de"`), Italian (`"it"`), Spanish (`"es"`), and several more. Have a look at the *Reporting Languages* section for the full list of supported languages and information on how the language setting is utilized. locale An optional locale ID to use for formatting values in the reporting table according the locale's rules. Examples include `"en-US"` for English (United States) and `"fr-FR"` for French (France). More simply, this can be a language identifier without a designation of territory, like `"es"` for Spanish. Returns ------- Validate A `Validate` object with the table and validations to be performed. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, the use of `Validate` with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for all validation steps. They are set here at the global level but can be overridden at the validation step level (each validation step has its own local `thresholds=` parameter). There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units for a validation step exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Actions ------- The `actions=` and `final_actions=` parameters provide mechanisms to respond to validation results. These actions can be used to notify users of validation failures, log issues, or trigger other processes when problems are detected. *Step Actions* The `actions=` parameter allows you to define actions that are triggered when validation steps exceed specific threshold levels (warning, error, or critical). These actions are executed during the interrogation process, right after each step is evaluated. Step actions should be provided using the [`Actions`](`pointblank.Actions`) class, which lets you specify different actions for different severity levels: ```python # Define an action that logs a message when warning threshold is exceeded def log_warning(): metadata = pb.get_action_metadata() print(f"WARNING: Step {metadata['step']} failed with type {metadata['type']}") # Define actions for different threshold levels actions = pb.Actions( warning = log_warning, error = lambda: send_email("Error in validation"), critical = "CRITICAL FAILURE DETECTED" ) # Use in Validate validation = pb.Validate( data=my_data, actions=actions # Global actions for all steps ) ``` You can also provide step-specific actions in individual validation methods: ```python validation.col_vals_gt( columns="revenue", value=0, actions=pb.Actions(warning=log_warning) # Only applies to this step ) ``` Step actions have access to step-specific context through the [`get_action_metadata()`](`pointblank.get_action_metadata`) function, which provides details about the current validation step that triggered the action. *Final Actions* The `final_actions=` parameter lets you define actions that execute after all validation steps have completed. These are useful for providing summaries, sending notifications based on overall validation status, or performing cleanup operations. Final actions should be provided using the [`FinalActions`](`pointblank.FinalActions`) class: ```python def send_report(): summary = pb.get_validation_summary() if summary["status"] == "CRITICAL": send_alert_email( subject=f"CRITICAL validation failures in {summary['tbl_name']}", body=f"{summary['critical_steps']} steps failed with critical severity." ) validation = pb.Validate( data=my_data, final_actions=pb.FinalActions(send_report) ) ``` Final actions have access to validation-wide summary information through the [`get_validation_summary()`](`pointblank.get_validation_summary`) function, which provides a comprehensive overview of the entire validation process. The combination of step actions and final actions provides a flexible system for responding to data quality issues at both the individual step level and the overall validation level. Reporting Languages ------------------- Various pieces of reporting in Pointblank can be localized to a specific language. This is done by setting the `lang=` parameter in `Validate`. Any of the following languages can be used (just provide the language code): - English (`"en"`) - French (`"fr"`) - German (`"de"`) - Italian (`"it"`) - Spanish (`"es"`) - Portuguese (`"pt"`) - Dutch (`"nl"`) - Swedish (`"sv"`) - Danish (`"da"`) - Norwegian Bokmål (`"nb"`) - Icelandic (`"is"`) - Finnish (`"fi"`) - Polish (`"pl"`) - Czech (`"cs"`) - Romanian (`"ro"`) - Greek (`"el"`) - Russian (`"ru"`) - Turkish (`"tr"`) - Arabic (`"ar"`) - Hindi (`"hi"`) - Simplified Chinese (`"zh-Hans"`) - Traditional Chinese (`"zh-Hant"`) - Japanese (`"ja"`) - Korean (`"ko"`) - Vietnamese (`"vi"`) - Indonesian (`"id"`) - Ukrainian (`"uk"`) - Hebrew (`"he"`) - Thai (`"th"`) - Persian (`"fa"`) Automatically generated briefs (produced by using `brief=True` or `brief="...{auto}..."`) will be written in the selected language. The language setting will also used when generating the validation report table through [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) (or printing the `Validate` object in a notebook environment). Examples -------- ### Creating a validation plan and interrogating Let's walk through a data quality analysis of an extremely small table. It's actually called `"small_table"` and it's accessible through the [`load_dataset()`](`pointblank.load_dataset`) function. ```python import pointblank as pb # Load the `small_table` dataset small_table = pb.load_dataset(dataset="small_table", tbl_type="polars") # Preview the table pb.preview(small_table) ``` We ought to think about what's tolerable in terms of data quality so let's designate proportional failure thresholds to the 'warning', 'error', and 'critical' states. This can be done by using the [`Thresholds`](`pointblank.Thresholds`) class. ```python thresholds = pb.Thresholds(warning=0.10, error=0.25, critical=0.35) ``` Now, we use the `Validate` class and give it the `thresholds` object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in `thresholds=` will make the reporting a bit more useful. We also need to provide a target table and we'll use `small_table` for this. ```python validation = ( pb.Validate( data=small_table, tbl_name="small_table", label="`Validate` example.", thresholds=thresholds ) ) ``` Then, as with any `Validate` object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the [`interrogate()`](`pointblank.Validate.interrogate`) method. ```python validation = ( validation .col_vals_gt(columns="d", value=100) .col_vals_le(columns="c", value=5) .col_vals_between(columns="c", left=3, right=10, na_pass=True) .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}") .col_exists(columns=["date", "date_time"]) .interrogate() ) ``` The `validation` object can be printed as a reporting table. ```python validation ``` The report could be further customized by using the [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`) method, which contains options for modifying the display of the table. ### Adding briefs Briefs are short descriptions of the validation steps. While they can be set for each step individually, they can also be set globally. The global setting is done by using the `brief=` argument in `Validate`. The global setting can be as simple as `True` to have automatically-generated briefs for each step. Alternatively, we can use templating elements like `"{step}"` (to insert the step number) or `"{auto}"` (to include an automatically generated brief). Here's an example of a global setting for briefs: ```python validation_2 = ( pb.Validate( data=pb.load_dataset(), tbl_name="small_table", label="Validation example with briefs", brief="Step {step}: {auto}", ) .col_vals_gt(columns="d", value=100) .col_vals_between(columns="c", left=3, right=10, na_pass=True) .col_vals_regex( columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}", brief="Regex check for column {col}" ) .interrogate() ) validation_2 ``` We see the text of the briefs appear in the `STEP` column of the reporting table. Furthermore, the global brief's template (`"Step {step}: {auto}"`) is applied to all steps except for the final step, where the step-level `brief=` argument provided an override. If you should want to cancel the globally-defined brief for one or more validation steps, you can set `brief=False` in those particular steps. ### Post-interrogation methods The `Validate` class has a number of post-interrogation methods that can be used to extract useful information from the validation results. For example, the [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method can be used to get the data extracts for each validation step. ```python validation_2.get_data_extracts() ``` We can also view step reports for each validation step using the [`get_step_report()`](`pointblank.Validate.get_step_report`) method. This method adapts to the type of validation step and shows the relevant information for a step's validation. ```python validation_2.get_step_report(i=2) ``` The `Validate` class also has a method for getting the sundered data, which is the data that passed or failed the validation steps. This can be done using the [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`) method. ```python pb.preview(validation_2.get_sundered_data()) ``` The sundered data is a DataFrame that contains the rows that passed or failed the validation. The default behavior is to return the rows that failed the validation, as shown above. ### Working with CSV Files The `Validate` class can directly accept CSV file paths, making it easy to validate data stored in CSV files without manual loading: ```python # Get a path to a CSV file from the package data csv_path = pb.get_data_path("global_sales", "csv") validation_3 = ( pb.Validate( data=csv_path, label="CSV validation example" ) .col_exists(["customer_id", "product_id", "revenue"]) .col_vals_not_null(["customer_id", "product_id"]) .col_vals_gt(columns="revenue", value=0) .interrogate() ) validation_3 ``` You can also use a Path object to specify the CSV file. Here's an example of how to do that: ```python from pathlib import Path csv_file = Path(pb.get_data_path("game_revenue", "csv")) validation_4 = ( pb.Validate(data=csv_file, label="Game Revenue Validation") .col_exists(["player_id", "session_id", "item_name"]) .col_vals_regex( columns="session_id", pattern=r"[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}" ) .col_vals_gt(columns="item_revenue", value=0, na_pass=True) .interrogate() ) validation_4 ``` The CSV loading is automatic, so when a string or Path with a `.csv` extension is provided, Pointblank will automatically load the file using the best available DataFrame library (Polars preferred, Pandas as fallback). The loaded data can then be used with all validation methods just like any other supported table type. ### Working with Parquet Files The `Validate` class can directly accept Parquet files and datasets in various formats. The following examples illustrate how to validate Parquet files: ```python # Single Parquet file from package data parquet_path = pb.get_data_path("nycflights", "parquet") validation_5 = ( pb.Validate( data=parquet_path, tbl_name="NYC Flights Data" ) .col_vals_not_null(["carrier", "origin", "dest"]) .col_vals_gt(columns="distance", value=0) .interrogate() ) validation_5 ``` You can also use glob patterns and directories. Here are some examples for how to: 1. load multiple Parquet files 2. load a Parquet-containing directory 3. load a partitioned Parquet dataset ```python # Multiple Parquet files with glob patterns validation_6 = pb.Validate(data="data/sales_*.parquet") # Directory containing Parquet files validation_7 = pb.Validate(data="parquet_data/") # Partitioned Parquet dataset validation_8 = ( pb.Validate(data="sales_data/") # Contains year=2023/quarter=Q1/region=US/sales.parquet .col_exists(["transaction_id", "amount", "year", "quarter", "region"]) .interrogate() ) ``` When you point to a directory that contains a partitioned Parquet dataset (with subdirectories like `year=2023/quarter=Q1/region=US/`), Pointblank will automatically: - discover all Parquet files recursively - extract partition column values from directory paths - add partition columns to the final DataFrame - combine all partitions into a single table for validation Both Polars and Pandas handle partitioned datasets natively, so this works seamlessly with either DataFrame library. The loading preference is Polars first, then Pandas as a fallback. ### Working with Database Connection Strings The `Validate` class supports database connection strings for direct validation of database tables. Connection strings must specify a table using the `::table_name` suffix: ```python # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") validation_9 = ( pb.Validate( data=f"duckdb:///{duckdb_path}::game_revenue", label="DuckDB Game Revenue Validation" ) .col_exists(["player_id", "session_id", "item_revenue"]) .col_vals_gt(columns="item_revenue", value=0) .interrogate() ) validation_9 ``` For comprehensive documentation on supported connection string formats, error handling, and installation requirements, see the [`connect_to_table()`](`pointblank.connect_to_table`) function. This function handles all the connection logic and provides helpful error messages when table specifications are missing or backend dependencies are not installed. Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bool | None' = None, critical: 'int | float | bool | None' = None) -> None Definition of threshold values. Thresholds are used to set limits on the number of failing test units at different levels. The levels are 'warning', 'error', and 'critical'. These levels correspond to different levels of severity when a threshold is reached. The threshold values can be set as absolute counts or as fractions of the total number of test units. When a threshold is reached, an action can be taken (e.g., displaying a message or calling a function) if there is an associated action defined for that level (defined through the [`Actions`](`pointblank.Actions`) class). Parameters ---------- warning The threshold for the 'warning' level. This can be an absolute count or a fraction of the total. Using `True` will set this threshold value to `1`. error The threshold for the 'error' level. This can be an absolute count or a fraction of the total. Using `True` will set this threshold value to `1`. critical The threshold for the 'critical' level. This can be an absolute count or a fraction of the total. Using `True` will set this threshold value to `1`. Returns ------- Thresholds A `Thresholds` object. This can be used when using the [`Validate`](`pointblank.Validate`) class (to set thresholds globally) or when defining validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that threshold values are scoped to individual validation steps, overriding any global thresholds). Examples -------- In a data validation workflow, you can set thresholds for the number of failing test units at different levels. For example, you can set a threshold for the 'warning' level when the number of failing test units exceeds 10% of the total number of test units: ```python thresholds_1 = pb.Thresholds(warning=0.1) ``` You can also set thresholds for the 'error' and 'critical' levels: ```python thresholds_2 = pb.Thresholds(warning=0.1, error=0.2, critical=0.05) ``` Thresholds can also be set as absolute counts. Here's an example where the 'warning' level is set to `5` failing test units: ```python thresholds_3 = pb.Thresholds(warning=5) ``` The `thresholds` object can be used to set global thresholds for all validation steps. Or, you can set thresholds for individual validation steps, which will override the global thresholds. Here's a data validation workflow example where we set global thresholds and then override with different thresholds at the [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) step: ```python validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table"), label="Example Validation", thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3) ) .col_vals_not_null(columns=["c", "d"]) .col_vals_gt(columns="a", value=3, thresholds=pb.Thresholds(warning=5)) .interrogate() ) validation ``` As can be seen, the last step ([`col_vals_gt()`](`pointblank.Validate.col_vals_gt`)) has its own thresholds, which override the global thresholds set at the beginning of the validation workflow (in the [`Validate`](`pointblank.Validate`) class). Actions(warning: 'str | Callable | list[str | Callable] | None' = None, error: 'str | Callable | list[str | Callable] | None' = None, critical: 'str | Callable | list[str | Callable] | None' = None, default: 'str | Callable | list[str | Callable] | None' = None, highest_only: 'bool' = True) -> None Definition of action values. Actions complement threshold values by defining what action should be taken when a threshold level is reached. The action can be a string or a `Callable`. When a string is used, it is interpreted as a message to be displayed. When a `Callable` is used, it will be invoked at interrogation time if the threshold level is met or exceeded. There are three threshold levels: 'warning', 'error', and 'critical'. These levels correspond to different levels of severity when a threshold is reached. Those thresholds can be defined using the [`Thresholds`](`pointblank.Thresholds`) class or various shorthand forms. Actions don't have to be defined for all threshold levels; if an action is not defined for a level in exceedance, no action will be taken. Likewise, there is no negative consequence (other than a no-op) for defining actions for thresholds that don't exist (e.g., setting an action for the 'critical' level when no corresponding 'critical' threshold has been set). Parameters ---------- warning A string, `Callable`, or list of `Callable`/string values for the 'warning' level. Using `None` means no action should be performed at the 'warning' level. error A string, `Callable`, or list of `Callable`/string values for the 'error' level. Using `None` means no action should be performed at the 'error' level. critical A string, `Callable`, or list of `Callable`/string values for the 'critical' level. Using `None` means no action should be performed at the 'critical' level. default A string, `Callable`, or list of `Callable`/string values for all threshold levels. This parameter can be used to set the same action for all threshold levels. If an action is defined for a specific threshold level, it will override the action set for all levels. highest_only A boolean value that, when set to `True` (the default), results in executing only the action for the highest threshold level that is exceeded. Useful when you want to ensure that only the most severe action is taken when multiple threshold levels are exceeded. Returns ------- Actions An `Actions` object. This can be used when using the [`Validate`](`pointblank.Validate`) class (to set actions for meeting different threshold levels globally) or when defining validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that actions are scoped to individual validation steps, overriding any globally set actions). Types of Actions ---------------- Actions can be defined in different ways: 1. **String**: A message to be displayed when the threshold level is met or exceeded. 2. **Callable**: A function that is called when the threshold level is met or exceeded. 3. **List of Strings/Callables**: Multiple messages or functions to be called when the threshold level is met or exceeded. The actions are executed at interrogation time when the threshold level assigned to the action is exceeded by the number or proportion of failing test units. When providing a string, it will simply be printed to the console. A callable will also be executed at the time of interrogation. If providing a list of strings or callables, each item in the list will be executed in order. Such a list can contain a mix of strings and callables. String Templating ----------------- When using a string as an action, you can include placeholders for the following variables: - `{type}`: The validation step type where the action is executed (e.g., 'col_vals_gt', 'col_vals_lt', etc.) - `{level}`: The threshold level where the action is executed ('warning', 'error', or 'critical') - `{step}` or `{i}`: The step number in the validation workflow where the action is executed - `{col}` or `{column}`: The column name where the action is executed - `{val}` or `{value}`: An associated value for the validation method (e.g., the value to compare against in a 'col_vals_gt' validation step) - `{time}`: A datetime value for when the action was executed The first two placeholders can also be used in uppercase (e.g., `{TYPE}` or `{LEVEL}`) and the corresponding values will be displayed in uppercase. The placeholders are replaced with the actual values during interrogation. For example, the string `"{LEVEL}: '{type}' threshold exceeded for column {col}."` will be displayed as `"WARNING: 'col_vals_gt' threshold exceeded for column a."` when the 'warning' threshold is exceeded in a 'col_vals_gt' validation step involving column `a`. Crafting Callables with `get_action_metadata()` ----------------------------------------------- When creating a callable function to be used as an action, you can use the [`get_action_metadata()`](`pointblank.get_action_metadata`) function to retrieve metadata about the step where the action is executed. This metadata contains information about the validation step, including the step type, level, step number, column name, and associated value. You can use this information to craft your action message or to take specific actions based on the metadata provided. Examples -------- Let's define both threshold values and actions for a data validation workflow. We'll set these thresholds and actions globally for all validation steps. In this specific example, the only actions we'll define are for the 'critical' level: ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(critical="Major data quality issue found in step {step}."), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) validation ``` Because we set the 'critical' action to display `"Major data quality issue found."` in the console, this message will be displayed if the number of failing test units exceeds the 'critical' threshold (set to 15% of the total number of test units). In step 3 of the validation workflow, the 'critical' threshold is exceeded, so the message is displayed in the console. Actions can be defined locally for individual validation steps, which will override any global actions set at the beginning of the validation workflow. Here's a variation of the above example where we set global threshold values but assign an action only for an individual validation step: ```python def dq_issue(): from datetime import datetime print(f"Data quality issue found ({datetime.now()}).") validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt( columns="session_duration", value=15, actions=pb.Actions(warning=dq_issue), ) .interrogate() ) validation ``` In this case, the 'warning' action is set to call the `dq_issue()` function. This action is only executed when the 'warning' threshold is exceeded in the 'session_duration' column. Because all three thresholds are exceeded in step 3, the 'warning' action of executing the function occurs (resulting in a message being printed to the console). If actions were set for the other two threshold levels, they would also be executed. See Also -------- The [`get_action_metadata()`](`pointblank.get_action_metadata`) function, which can be used to retrieve metadata about the step where the action is executed. FinalActions(*args) Define actions to be taken after validation is complete. Final actions are executed after all validation steps have been completed. They provide a mechanism to respond to the overall validation results, such as sending alerts when critical failures are detected or generating summary reports. Parameters ---------- *actions One or more actions to execute after validation. An action can be (1) a callable function that will be executed with no arguments, or (2) a string message that will be printed to the console. Returns ------- FinalActions An `FinalActions` object. This can be used when using the [`Validate`](`pointblank.Validate`) class (to set final actions for the validation workflow). Types of Actions ---------------- Final actions can be defined in two different ways: 1. **String**: A message to be displayed when the validation is complete. 2. **Callable**: A function that is called when the validation is complete. The actions are executed at the end of the validation workflow. When providing a string, it will simply be printed to the console. A callable will also be executed at the time of validation completion. Several strings and callables can be provided to the `FinalActions` class, and they will be executed in the order they are provided. Crafting Callables with `get_validation_summary()` ------------------------------------------------- When creating a callable function to be used as a final action, you can use the [`get_validation_summary()`](`pointblank.get_validation_summary`) function to retrieve the summary of the validation results. This summary contains information about the validation workflow, including the number of test units, the number of failing test units, and the threshold levels that were exceeded. You can use this information to craft your final action message or to take specific actions based on the validation results. Examples -------- Final actions provide a powerful way to respond to the overall results of a validation workflow. They're especially useful for sending notifications, generating reports, or taking corrective actions based on the complete validation outcome. The following example shows how to create a final action that checks for critical failures and sends an alert: ```python import pointblank as pb def send_alert(): summary = pb.get_validation_summary() if summary["highest_severity"] == "critical": print(f"ALERT: Critical validation failures found in {summary['tbl_name']}") validation = ( pb.Validate( data=my_data, final_actions=pb.FinalActions(send_alert) ) .col_vals_gt(columns="revenue", value=0) .interrogate() ) ``` In this example, the `send_alert()` function is defined to check the validation summary for critical failures. If any are found, an alert message is printed to the console. The function is passed to the `FinalActions` class, which ensures it will be executed after all validation steps are complete. Note that we used the [`get_validation_summary()`](`pointblank.get_validation_summary`) function to retrieve the summary of the validation results to help craft the alert message. Multiple final actions can be provided in a sequence. They will be executed in the order they are specified after all validation steps have completed: ```python validation = ( pb.Validate( data=my_data, final_actions=pb.FinalActions( "Validation complete.", # a string message send_alert, # a callable function generate_report # another callable function ) ) .col_vals_gt(columns="revenue", value=0) .interrogate() ) ``` See Also -------- The [`get_validation_summary()`](`pointblank.get_validation_summary`) function, which can be used to retrieve the summary of the validation results. Schema(columns: 'str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None' = None, tbl: 'any | None' = None, **kwargs) Definition of a schema object. The schema object defines the structure of a table. Once it is defined, the object can be used in a validation workflow, using `Validate` and its methods, to ensure that the structure of a table matches the expected schema. The validation method that works with the schema object is called [`col_schema_match()`](`pointblank.Validate.col_schema_match`). A schema for a table can be constructed with the `Schema` class in a number of ways: 1. providing a list of column names to `columns=` (to check only the column names) 2. using a list of one- or two-element tuples in `columns=` (to check both column names and optionally dtypes, should be in the form of `[(column_name, dtype), ...]`) 3. providing a dictionary to `columns=`, where the keys are column names and the values are dtypes 4. providing individual column arguments in the form of keyword arguments (constructed as `column_name=dtype`) The schema object can also be constructed by providing a DataFrame or Ibis table object (using the `tbl=` parameter) and the schema will be collected from either type of object. The schema object can be printed to display the column names and dtypes. Note that if `tbl=` is provided then there shouldn't be any other inputs provided through either `columns=` or `**kwargs`. Parameters ---------- columns A list of strings (representing column names), a list of tuples (for column names and column dtypes), or a dictionary containing column and dtype information. If any of these inputs are provided here, it will take precedence over any column arguments provided via `**kwargs`. tbl A DataFrame (Polars or Pandas) or an Ibis table object from which the schema will be collected. Read the *Supported Input Table Types* section for details on the supported table types. **kwargs Individual column arguments that are in the form of `column=dtype` or `column=[dtype1, dtype2, ...]`. These will be ignored if the `columns=` parameter is not `None`. Returns ------- Schema A schema object. Supported Input Table Types --------------------------- The `tbl=` parameter, if used, can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `Schema(tbl=)` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. Additional Notes on Schema Construction --------------------------------------- While there is flexibility in how a schema can be constructed, there is the potential for some confusion. So let's go through each of the methods of constructing a schema in more detail and single out some important points. When providing a list of column names to `columns=`, a [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation step will only check the column names. Any arguments pertaining to dtypes will be ignored. When using a list of tuples in `columns=`, the tuples could contain the column name and dtype or just the column name. This construction allows for more flexibility in constructing the schema as some columns will be checked for dtypes and others will not. This method is the only way to have mixed checks of column names and dtypes in [`col_schema_match()`](`pointblank.Validate.col_schema_match`). When providing a dictionary to `columns=`, the keys are the column names and the values are the dtypes. This method of input is useful in those cases where you might already have a dictionary of column names and dtypes that you want to use as the schema. If using individual column arguments in the form of keyword arguments, the column names are the keyword arguments and the dtypes are the values. This method emphasizes readability and is perhaps more convenient when manually constructing a schema with a small number of columns. Finally, multiple dtypes can be provided for a single column by providing a list or tuple of dtypes in place of a scalar string value. Having multiple dtypes for a column allows for the dtype check via [`col_schema_match()`](`pointblank.Validate.col_schema_match`) to make multiple attempts at matching the column dtype. Should any of the dtypes match the column dtype, that part of the schema check will pass. Here are some examples of how you could provide single and multiple dtypes for a column: ```python # list of tuples schema_1 = pb.Schema(columns=[("name", "String"), ("age", ["Float64", "Int64"])]) # dictionary schema_2 = pb.Schema(columns={"name": "String", "age": ["Float64", "Int64"]}) # keyword arguments schema_3 = pb.Schema(name="String", age=["Float64", "Int64"]) ``` All of the above examples will construct the same schema object. Examples -------- A schema can be constructed via the `Schema` class in multiple ways. Let's use the following Polars DataFrame as a basis for constructing a schema: ```python import pointblank as pb import polars as pl df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "height": [5.6, 6.0, 5.8] }) ``` You could provide `Schema(columns=)` a list of tuples containing column names and data types: ```python schema = pb.Schema(columns=[("name", "String"), ("age", "Int64"), ("height", "Float64")]) ``` Alternatively, a dictionary containing column names and dtypes also works: ```python schema = pb.Schema(columns={"name": "String", "age": "Int64", "height": "Float64"}) ``` Another input method involves using individual column arguments in the form of keyword arguments: ```python schema = pb.Schema(name="String", age="Int64", height="Float64") ``` Finally, could also provide a DataFrame (Polars and Pandas) or an Ibis table object to `tbl=` and the schema will be collected: ```python schema = pb.Schema(tbl=df) ``` Whichever method you choose, you can verify the schema inputs by printing the `schema` object: ```python print(schema) ``` The `Schema` object can be used to validate the structure of a table against the schema. The relevant `Validate` method for this is [`col_schema_match()`](`pointblank.Validate.col_schema_match`). In a validation workflow, you'll have a target table (defined at the beginning of the workflow) and you might want to ensure that your expectations of the table structure are met. The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) method works with a `Schema` object to validate the structure of the table. Here's an example of how you could use [`col_schema_match()`](`pointblank.Validate.col_schema_match`) in a validation workflow: ```python # Define the schema schema = pb.Schema(name="String", age="Int64", height="Float64") # Define a validation that checks the schema against the table (`df`) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) # Display the validation results validation ``` The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation method will validate the structure of the table against the schema during interrogation. If the structure of the table does not match the schema, the single test unit will fail. In this case, the defined schema matched the structure of the table, so the validation passed. We can also choose to check only the column names of the target table. This can be done by providing a simplified `Schema` object, which is given a list of column names: ```python schema = pb.Schema(columns=["name", "age", "height"]) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) validation ``` In this case, the schema only checks the column names of the table against the schema during interrogation. If the column names of the table do not match the schema, the single test unit will fail. In this case, the defined schema matched the column names of the table, so the validation passed. If you wanted to check column names and dtypes only for a subset of columns (and just the column names for the rest), you could use a list of mixed one- or two-item tuples in `columns=`: ```python schema = pb.Schema(columns=[("name", "String"), ("age", ), ("height", )]) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) validation ``` Not specifying a dtype for a column (as is the case for the `age` and `height` columns in the above example) will only check the column name. There may also be the case where you want to check the column names and specify multiple dtypes for a column to have several attempts at matching the dtype. This can be done by providing a list of dtypes where there would normally be a single dtype: ```python schema = pb.Schema( columns=[("name", "String"), ("age", ["Float64", "Int64"]), ("height", "Float64")] ) validation = ( pb.Validate(data=df) .col_schema_match(schema=schema) .interrogate() ) validation ``` For the `age` column, the schema will check for both `Float64` and `Int64` dtypes. If either of these dtypes is found in the column, the portion of the schema check will succeed. See Also -------- The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation method, where a `Schema` object is used in a validation workflow. DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None, verify_ssl: 'bool' = True) -> None Draft a validation plan for a given table using an LLM. By using a large language model (LLM) to draft a validation plan, you can quickly generate a starting point for validating a table. This can be useful when you have a new table and you want to get a sense of how to validate it (and adjustments could always be made later). The `DraftValidation` class uses the `chatlas` package to draft a validation plan for a given table using an LLM from either the `"anthropic"`, `"openai"`, `"ollama"` or `"bedrock"` provider. You can install all requirements for the class through an optional 'generate' install of Pointblank via `pip install pointblank[generate]`. :::{.callout-warning} The `DraftValidation` class is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- data The data to be used for drafting a validation plan. model The model to be used. This should be in the form of `provider:model` (e.g., `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`, `"ollama"`, and `"bedrock"`. api_key The API key to be used for the model. verify_ssl Whether to verify SSL certificates when making requests to the LLM provider. Set to `False` to disable SSL verification (e.g., when behind a corporate firewall with self-signed certificates). Defaults to `True`. Use with caution as disabling SSL verification can pose security risks. Returns ------- str The drafted validation plan. Constructing the `model` Argument --------------------------------- The `model=` argument should be constructed using the provider and model name separated by a colon (`provider:model`). The provider text can any of: - `"anthropic"` (Anthropic) - `"openai"` (OpenAI) - `"ollama"` (Ollama) - `"bedrock"` (Amazon Bedrock) The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. Notes on Authentication ----------------------- Providing a valid API key as a string in the `api_key` argument is adequate for getting started but you should consider using a more secure method for handling API keys. One way to do this is to load the API key from an environent variable and retrieve it using the `os` module (specifically the `os.getenv()` function). Places to store the API key might include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`. Another solution is to store one or more model provider API keys in an `.env` file (in the root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file and there's no need to provide the `api_key` argument. An `.env` file might look like this: ```plaintext ANTHROPIC_API_KEY="your_anthropic_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ``` There's no need to have the `python-dotenv` package installed when using `.env` files in this way. Notes on SSL Certificate Verification -------------------------------------- By default, SSL certificate verification is enabled for all requests to LLM providers. However, in certain network environments (such as corporate networks with self-signed certificates or firewall proxies), you may encounter SSL certificate verification errors. To disable SSL verification, set the `verify_ssl` parameter to `False`: ```python import pointblank as pb data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") # Disable SSL verification for networks with self-signed certificates pb.DraftValidation( data=data, model="anthropic:claude-sonnet-4-5", verify_ssl=False ) ``` :::{.callout-warning} Disabling SSL verification (through `verify_ssl=False`) can expose your API keys and data to man-in-the-middle attacks. Only use this option in trusted network environments and when absolutely necessary. ::: Notes on Data Sent to the Model Provider ---------------------------------------- The data sent to the model provider is a JSON summary of the table. This data summary is generated internally by `DraftValidation` using the `DataScan` class. The summary includes the following information: - the number of rows and columns in the table - the type of dataset (e.g., Polars, DuckDB, Pandas, etc.) - the column names and their types - column level statistics such as the number of missing values, min, max, mean, and median, etc. - a short list of data values in each column The JSON summary is used to provide the model with the necessary information to draft a validation plan. As such, even very large tables can be used with the `DraftValidation` class since the contents of the table are not sent to the model provider. The Amazon Bedrock is a special case since it is a self-hosted model and security controls are in place to ensure that data is kept within the user's AWS environment. If using an Ollama model all data is handled locally, though only a few models are capable enough to perform the task of drafting a validation plan. Examples -------- Let's look at how the `DraftValidation` class can be used to draft a validation plan for a table. The table to be used is `"nycflights"`, which is available here via the [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is `"anthropic:claude-sonnet-4-5"` (which performs very well compared to other LLMs). The example assumes that the API key is stored in an `.env` file as `ANTHROPIC_API_KEY`. ```python import pointblank as pb # Load the "nycflights" dataset as a DuckDB table data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") # Draft a validation plan for the "nycflights" table pb.DraftValidation(data=data, model="anthropic:claude-sonnet-4-5") ``` The output will be a drafted validation plan for the `"nycflights"` table and this will appear in the console. ````plaintext ```python import pointblank as pb # Define schema based on column names and dtypes schema = pb.Schema(columns=[ ("year", "int64"), ("month", "int64"), ("day", "int64"), ("dep_time", "int64"), ("sched_dep_time", "int64"), ("dep_delay", "int64"), ("arr_time", "int64"), ("sched_arr_time", "int64"), ("arr_delay", "int64"), ("carrier", "string"), ("flight", "int64"), ("tailnum", "string"), ("origin", "string"), ("dest", "string"), ("air_time", "int64"), ("distance", "int64"), ("hour", "int64"), ("minute", "int64") ]) # The validation plan validation = ( pb.Validate( data=your_data, # Replace your_data with the actual data variable label="Draft Validation", thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35) ) .col_schema_match(schema=schema) .col_vals_not_null(columns=[ "year", "month", "day", "sched_dep_time", "carrier", "flight", "origin", "dest", "distance", "hour", "minute" ]) .col_vals_between(columns="month", left=1, right=12) .col_vals_between(columns="day", left=1, right=31) .col_vals_between(columns="sched_dep_time", left=106, right=2359) .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True) .col_vals_between(columns="air_time", left=20, right=695, na_pass=True) .col_vals_between(columns="distance", left=17, right=4983) .col_vals_between(columns="hour", left=1, right=23) .col_vals_between(columns="minute", left=0, right=59) .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"]) .col_count_match(count=18) .row_count_match(count=336776) .rows_distinct() .interrogate() ) validation ``` ```` The drafted validation plan can be copied and pasted into a Python script or notebook for further use. In other words, the generated plan can be adjusted as needed to suit the specific requirements of the table being validated. Note that the output does not know how the data was obtained, so it uses the placeholder `your_data` in the `data=` argument of the `Validate` class. When adapted for use, this should be replaced with the actual data variable. ## The Validation Steps family Validation steps can be thought of as sequential validations on the target data. We call `Validate`'s validation methods to build up a validation plan: a collection of steps that, in the aggregate, provides good validation coverage. col_vals_gt(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data greater than a fixed value or data in another column? The `col_vals_gt()` validation method checks whether column values in a table are *greater than* a specified `value=` (the exact comparison used in this function is `col_val > value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 7, 6, 5], "b": [1, 2, 1, 2, 2, 2], "c": [2, 1, 2, 2, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all greater than the value of `4`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=4) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_gt()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_gt()` to check whether the values in column `c` are greater than values in column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="c", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 1: `c` is `1` and `b` is `2`. - Row 3: `c` is `2` and `b` is `2`. col_vals_lt(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data less than a fixed value or data in another column? The `col_vals_lt()` validation method checks whether column values in a table are *less than* a specified `value=` (the exact comparison used in this function is `col_val < value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 9, 7, 5], "b": [1, 2, 1, 2, 2, 2], "c": [2, 1, 1, 4, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all less than the value of `10`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_lt(columns="a", value=10) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_lt()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_lt()` to check whether the values in column `b` are less than values in column `c`. ```python validation = ( pb.Validate(data=tbl) .col_vals_lt(columns="b", value=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 1: `b` is `2` and `c` is `1`. - Row 2: `b` is `1` and `c` is `1`. col_vals_ge(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data greater than or equal to a fixed value or data in another column? The `col_vals_ge()` validation method checks whether column values in a table are *greater than or equal to* a specified `value=` (the exact comparison used in this function is `col_val >= value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 9, 7, 5], "b": [5, 3, 1, 8, 2, 3], "c": [2, 3, 1, 4, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all greater than or equal to the value of `5`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_ge(columns="a", value=5) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_ge()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_ge()` to check whether the values in column `b` are greater than values in column `c`. ```python validation = ( pb.Validate(data=tbl) .col_vals_ge(columns="b", value=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 0: `b` is `2` and `c` is `3`. - Row 4: `b` is `3` and `c` is `4`. col_vals_le(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data less than or equal to a fixed value or data in another column? The `col_vals_le()` validation method checks whether column values in a table are *less than or equal to* a specified `value=` (the exact comparison used in this function is `col_val <= value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 9, 7, 5], "b": [1, 3, 1, 5, 2, 5], "c": [2, 1, 1, 4, 3, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all less than or equal to the value of `9`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_le(columns="a", value=9) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_le()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_le()` to check whether the values in column `c` are less than values in column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_le(columns="c", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 0: `c` is `2` and `b` is `1`. - Row 4: `c` is `3` and `b` is `2`. col_vals_eq(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data equal to a fixed value or data in another column? The `col_vals_eq()` validation method checks whether column values in a table are *equal to* a specified `value=` (the exact comparison used in this function is `col_val == value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 5, 5, 5, 5, 5], "b": [5, 5, 5, 6, 5, 4], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all equal to the value of `5`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_eq(columns="a", value=5) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_eq()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_eq()` to check whether the values in column `a` are equal to the values in column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_eq(columns="a", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 3: `a` is `5` and `b` is `6`. - Row 5: `a` is `5` and `b` is `4`. col_vals_ne(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', value: 'float | int | Column', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data not equal to a fixed value or data in another column? The `col_vals_ne()` validation method checks whether column values in a table are *not equal to* a specified `value=` (the exact comparison used in this function is `col_val != value`). The `value=` can be specified as a single, literal value or as a column name given in [`col()`](`pointblank.col`). This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. value The value to compare against. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison. For more information on which types of values are allowed, see the *What Can Be Used in `value=`?* section. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `value=`? ----------------------------- The `value=` argument allows for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column name When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value as the `value=` argument. There is flexibility in how you provide the date or datetime value, as it can be: - a string-based date or datetime (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - a date or datetime object using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in the `value=` argument, it must be specified within [`col()`](`pointblank.col`). This is a column-to-column comparison and, crucially, the columns being compared must be of the same type (e.g., both numeric, both date, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `value=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 5, 5, 5, 5, 5], "b": [5, 6, 3, 6, 5, 8], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are not equal to the value of `3`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_ne(columns="a", value=3) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_ne()`. All test units passed, and there are no failing test units. Aside from checking a column against a literal value, we can also use a column name in the `value=` argument (with the helper function [`col()`](`pointblank.col`) to perform a column-to-column comparison. For the next example, we'll use `col_vals_ne()` to check whether the values in column `a` aren't equal to the values in column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_ne(columns="a", value=pb.col("b")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are in rows 0 and 4, where `a` is `5` and `b` is `5` in both cases (i.e., they are equal to each other). col_vals_between(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', left: 'float | int | Column', right: 'float | int | Column', inclusive: 'tuple[bool, bool]' = (True, True), na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Do column data lie between two specified values or data in other columns? The `col_vals_between()` validation method checks whether column values in a table fall within a range. The range is specified with three arguments: `left=`, `right=`, and `inclusive=`. The `left=` and `right=` values specify the lower and upper bounds. These bounds can be specified as literal values or as column names provided within [`col()`](`pointblank.col`). The validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. left The lower bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. right The upper bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. inclusive A tuple of two boolean values indicating whether the comparison should be inclusive. The position of the boolean values correspond to the `left=` and `right=` values, respectively. By default, both values are `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `left=` and `right=`? ----------------------------------------- The `left=` and `right=` arguments both allow for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column in the target table When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value within `left=` and `right=`. There is flexibility in how you provide the date or datetime values for the bounds; they can be: - string-based dates or datetimes (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - date or datetime objects using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in either `left=` or `right=` (or both), it must be specified within [`col()`](`pointblank.col`). This facilitates column-to-column comparisons and, crucially, the columns being compared to either/both of the bounds must be of the same type as the column data (e.g., all numeric, all dates, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `left=col(...)`/`right=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [2, 3, 2, 4, 3, 4], "b": [5, 6, 1, 6, 8, 5], "c": [9, 8, 8, 7, 7, 8], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all between the fixed boundary values of `1` and `5`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_between(columns="a", left=1, right=5) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_between()`. All test units passed, and there are no failing test units. Aside from checking a column against two literal values representing the lower and upper bounds, we can also provide column names to the `left=` and/or `right=` arguments (by using the helper function [`col()`](`pointblank.col`). In this way, we can perform three additional comparison types: 1. `left=column`, `right=column` 2. `left=literal`, `right=column` 3. `left=column`, `right=literal` For the next example, we'll use `col_vals_between()` to check whether the values in column `b` are between than corresponding values in columns `a` (lower bound) and `c` (upper bound). ```python validation = ( pb.Validate(data=tbl) .col_vals_between(columns="b", left=pb.col("a"), right=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 2: `b` is `1` but the bounds are `2` (`a`) and `8` (`c`). - Row 4: `b` is `8` but the bounds are `3` (`a`) and `7` (`c`). col_vals_outside(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', left: 'float | int | Column', right: 'float | int | Column', inclusive: 'tuple[bool, bool]' = (True, True), na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Do column data lie outside of two specified values or data in other columns? The `col_vals_between()` validation method checks whether column values in a table *do not* fall within a certain range. The range is specified with three arguments: `left=`, `right=`, and `inclusive=`. The `left=` and `right=` values specify the lower and upper bounds. These bounds can be specified as literal values or as column names provided within [`col()`](`pointblank.col`). The validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. left The lower bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. right The upper bound of the range. This can be a single value or a single column name given in [`col()`](`pointblank.col`). The latter option allows for a column-to-column comparison for this bound. See the *What Can Be Used in `left=` and `right=`?* section for details on this. inclusive A tuple of two boolean values indicating whether the comparison should be inclusive. The position of the boolean values correspond to the `left=` and `right=` values, respectively. By default, both values are `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. What Can Be Used in `left=` and `right=`? ----------------------------------------- The `left=` and `right=` arguments both allow for a variety of input types. The most common are: - a single numeric value - a single date or datetime value - A [`col()`](`pointblank.col`) object that represents a column in the target table When supplying a number as the basis of comparison, keep in mind that all resolved columns must also be numeric. Should you have columns that are of the date or datetime types, you can supply a date or datetime value within `left=` and `right=`. There is flexibility in how you provide the date or datetime values for the bounds; they can be: - string-based dates or datetimes (e.g., `"2023-10-01"`, `"2023-10-01 13:45:30"`, etc.) - date or datetime objects using the `datetime` module (e.g., `datetime.date(2023, 10, 1)`, `datetime.datetime(2023, 10, 1, 13, 45, 30)`, etc.) Finally, when supplying a column name in either `left=` or `right=` (or both), it must be specified within [`col()`](`pointblank.col`). This facilitates column-to-column comparisons and, crucially, the columns being compared to either/both of the bounds must be of the same type as the column data (e.g., all numeric, all dates, etc.). Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns=` and `left=col(...)`/`right=col(...)` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 7, 5, 5], "b": [2, 3, 6, 4, 3, 6], "c": [9, 8, 8, 9, 9, 7], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all outside the fixed boundary values of `1` and `4`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_outside(columns="a", left=1, right=4) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_outside()`. All test units passed, and there are no failing test units. Aside from checking a column against two literal values representing the lower and upper bounds, we can also provide column names to the `left=` and/or `right=` arguments (by using the helper function [`col()`](`pointblank.col`). In this way, we can perform three additional comparison types: 1. `left=column`, `right=column` 2. `left=literal`, `right=column` 3. `left=column`, `right=literal` For the next example, we'll use `col_vals_outside()` to check whether the values in column `b` are outside of the range formed by the corresponding values in columns `a` (lower bound) and `c` (upper bound). ```python validation = ( pb.Validate(data=tbl) .col_vals_outside(columns="b", left=pb.col("a"), right=pb.col("c")) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are: - Row 2: `b` is `6` and the bounds are `5` (`a`) and `8` (`c`). - Row 5: `b` is `6` and the bounds are `5` (`a`) and `7` (`c`). col_vals_in_set(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', set: 'Collection[Any]', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether column values are in a set of values. The `col_vals_in_set()` validation method checks whether column values in a table are part of a specified `set=` of values. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. set A collection of values to compare against. Can be a list of values, a Python Enum class, or a collection containing Enum instances. When an Enum class is provided, all enum values will be used. When a collection contains Enum instances, their values will be extracted automatically. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 2, 4, 6, 2, 5], "b": [5, 8, 2, 6, 5, 1], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are all in the set of `[2, 3, 4, 5, 6]`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_in_set(columns="a", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_in_set()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_in_set(columns="b", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the column `b` values of `8` and `1`, which are not in the set of `[2, 3, 4, 5, 6]`. **Using Python Enums** The `col_vals_in_set()` method also supports Python Enum classes and instances, which can make validations more readable and maintainable: ```python from enum import Enum class Color(Enum): RED = "red" GREEN = "green" BLUE = "blue" # Create a table with color data tbl_colors = pl.DataFrame({ "product": ["shirt", "pants", "hat", "shoes"], "color": ["red", "blue", "green", "yellow"] }) # Validate using an Enum class (all enum values are allowed) validation = ( pb.Validate(data=tbl_colors) .col_vals_in_set(columns="color", set=Color) .interrogate() ) validation ``` This validation will fail for the `"yellow"` value since it's not in the `Color` enum. You can also use specific Enum instances or mix them with regular values: ```python # Validate using specific Enum instances validation = ( pb.Validate(data=tbl_colors) .col_vals_in_set(columns="color", set=[Color.RED, Color.BLUE]) .interrogate() ) # Mix Enum instances with regular values validation = ( pb.Validate(data=tbl_colors) .col_vals_in_set(columns="color", set=[Color.RED, Color.BLUE, "yellow"]) .interrogate() ) validation ``` In this case, the `"green"` value will cause a failing test unit since it's not part of the specified set. col_vals_not_in_set(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', set: 'Collection[Any]', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether column values are not in a set of values. The `col_vals_not_in_set()` validation method checks whether column values in a table are *not* part of a specified `set=` of values. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. set A collection of values to compare against. Can be a list of values, a Python Enum class, or a collection containing Enum instances. When an Enum class is provided, all enum values will be used. When a collection contains Enum instances, their values will be extracted automatically. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 8, 1, 9, 1, 7], "b": [1, 8, 2, 6, 9, 1], } ) pb.preview(tbl) ``` Let's validate that none of the values in column `a` are in the set of `[2, 3, 4, 5, 6]`. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_not_in_set(columns="a", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_not_in_set()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_not_in_set(columns="b", set=[2, 3, 4, 5, 6]) .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the column `b` values of `2` and `6`, both of which are in the set of `[2, 3, 4, 5, 6]`. **Using Python Enums** Like `col_vals_in_set()`, this method also supports Python Enum classes and instances: ```python from enum import Enum class InvalidStatus(Enum): DELETED = "deleted" ARCHIVED = "archived" # Create a table with status data status_table = pl.DataFrame({ "product": ["widget", "gadget", "tool", "device"], "status": ["active", "pending", "deleted", "active"] }) # Validate that no values are in the invalid status set validation = ( pb.Validate(data=status_table) .col_vals_not_in_set(columns="status", set=InvalidStatus) .interrogate() ) validation ``` This `"deleted"` value in the `status` column will fail since it matches one of the invalid statuses in the `InvalidStatus` enum. col_vals_increasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, decreasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data increasing by row? The `col_vals_increasing()` validation method checks whether column values in a table are increasing when moving down a table. There are options for allowing missing values in the target column, allowing stationary phases (where consecutive values don't change), and even one for allowing decreasing movements up to a certain threshold. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. allow_stationary An option to allow pauses in increasing values. For example, if the values for the test units are `[80, 82, 82, 85, 88]` then the third unit (`82`, appearing a second time) would be marked as failing when `allow_stationary` is `False`. Using `allow_stationary=True` will result in all the test units in `[80, 82, 82, 85, 88]` to be marked as passing. decreasing_tol An optional threshold value that allows for movement of numerical values in the negative direction. By default this is `None` but using a numerical value will set the absolute threshold of negative travel allowed across numerical test units. Note that setting a value here also has the effect of setting `allow_stationary` to `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Examples -------- For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 3, 4, 5, 6], "b": [1, 2, 2, 3, 4, 5], "c": [1, 2, 1, 3, 4, 5], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are increasing. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_increasing(columns="a") .interrogate() ) validation ``` The validation passed as all values in column `a` are increasing. Now let's check column `b` which has a stationary value: ```python validation = ( pb.Validate(data=tbl) .col_vals_increasing(columns="b") .interrogate() ) validation ``` This validation fails at the third row because the value `2` is repeated. If we want to allow stationary values, we can use `allow_stationary=True`: ```python validation = ( pb.Validate(data=tbl) .col_vals_increasing(columns="b", allow_stationary=True) .interrogate() ) validation ``` col_vals_decreasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, increasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Are column data decreasing by row? The `col_vals_decreasing()` validation method checks whether column values in a table are decreasing when moving down a table. There are options for allowing missing values in the target column, allowing stationary phases (where consecutive values don't change), and even one for allowing increasing movements up to a certain threshold. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. allow_stationary An option to allow pauses in decreasing values. For example, if the values for the test units are `[88, 85, 85, 82, 80]` then the third unit (`85`, appearing a second time) would be marked as failing when `allow_stationary` is `False`. Using `allow_stationary=True` will result in all the test units in `[88, 85, 85, 82, 80]` to be marked as passing. increasing_tol An optional threshold value that allows for movement of numerical values in the positive direction. By default this is `None` but using a numerical value will set the absolute threshold of positive travel allowed across numerical test units. Note that setting a value here also has the effect of setting `allow_stationary` to `True`. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Examples -------- For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [6, 5, 4, 3, 2, 1], "b": [5, 4, 4, 3, 2, 1], "c": [5, 4, 5, 3, 2, 1], } ) pb.preview(tbl) ``` Let's validate that values in column `a` are decreasing. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_decreasing(columns="a") .interrogate() ) validation ``` The validation passed as all values in column `a` are decreasing. Now let's check column `b` which has a stationary value: ```python validation = ( pb.Validate(data=tbl) .col_vals_decreasing(columns="b") .interrogate() ) validation ``` This validation fails at the third row because the value `4` is repeated. If we want to allow stationary values, we can use `allow_stationary=True`: ```python validation = ( pb.Validate(data=tbl) .col_vals_decreasing(columns="b", allow_stationary=True) .interrogate() ) validation ``` col_vals_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether values in a column are Null. The `col_vals_null()` validation method checks whether column values in a table are Null. This validation will operate over the number of test units that is equal to the number of rows in the table. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [None, None, None, None], "b": [None, 2, None, 9], } ).with_columns(pl.col("a").cast(pl.Int64)) pb.preview(tbl) ``` Let's validate that values in column `a` are all Null values. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_null(columns="a") .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_null()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_null(columns="b") .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the two non-Null values in column `b`. col_vals_not_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether values in a column are not Null. The `col_vals_not_null()` validation method checks whether column values in a table are not Null. This validation will operate over the number of test units that is equal to the number of rows in the table. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two numeric columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [4, 7, 2, 8], "b": [5, None, 1, None], } ) pb.preview(tbl) ``` Let's validate that none of the values in column `a` are Null values. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_not_null(columns="a") .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_not_null()`. All test units passed, and there are no failing test units. Now, let's use that same set of values for a validation on column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_not_null(columns="b") .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the two Null values in column `b`. col_vals_regex(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pattern: 'str', na_pass: 'bool' = False, inverse: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether column values match a regular expression pattern. The `col_vals_regex()` validation method checks whether column values in a table correspond to a `pattern=` matching expression. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. pattern A regular expression pattern to compare against. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. inverse Should the validation step be inverted? If `True`, then the expectation is that column values should *not* match the specified `pattern=` regex. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with two string columns (`a` and `b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["rb-0343", "ra-0232", "ry-0954", "rc-1343"], "b": ["ra-0628", "ra-583", "rya-0826", "rb-0735"], } ) pb.preview(tbl) ``` Let's validate that all of the values in column `a` match a particular regex pattern. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_regex(columns="a", pattern=r"r[a-z]-[0-9]{4}") .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_regex()`. All test units passed, and there are no failing test units. Now, let's use the same regex for a validation on column `b`. ```python validation = ( pb.Validate(data=tbl) .col_vals_regex(columns="b", pattern=r"r[a-z]-[0-9]{4}") .interrogate() ) validation ``` The validation table reports two failing test units. The specific failing cases are for the string values of rows 1 and 2 in column `b`. col_vals_within_spec(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', spec: 'str', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether column values fit within a specification. The `col_vals_within_spec()` validation method checks whether column values in a table correspond to a specification (`spec=`) type (details of which are available in the *Specifications* section). Specifications include common data types like email addresses, URLs, postal codes, vehicle identification numbers (VINs), International Bank Account Numbers (IBANs), and more. This validation will operate over the number of test units that is equal to the number of rows in the table. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. spec A specification string for defining the specification type. Examples are `"email"`, `"url"`, and `"postal_code[USA]"`. See the *Specifications* section for all available options. na_pass Should any encountered None, NA, or Null values be considered as passing test units? By default, this is `False`. Set to `True` to pass test units with missing values. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Specifications -------------- A specification type must be used with the `spec=` argument. This is a string-based keyword that corresponds to the type of data in the specified columns. The following keywords can be used: - `"isbn"`: The International Standard Book Number (ISBN) is a unique numerical identifier for books. This keyword validates both 10-digit and 13-digit ISBNs. - `"vin"`: A vehicle identification number (VIN) is a unique code used by the automotive industry to identify individual motor vehicles. - `"postal_code[]"`: A postal code (also known as postcodes, PIN, or ZIP codes) is a series of letters, digits, or both included in a postal address. Because the coding varies by country, a country code in either the 2-letter (ISO 3166-1 alpha-2) or 3-letter (ISO 3166-1 alpha-3) format needs to be supplied (e.g., `"postal_code[US]"` or `"postal_code[USA]"`). The keyword alias `"zip"` can be used for US ZIP codes. - `"credit_card"`: A credit card number can be validated across a variety of issuers. The validation uses the Luhn algorithm. - `"iban[]"`: The International Bank Account Number (IBAN) is a system of identifying bank accounts across countries. Because the length and coding varies by country, a country code needs to be supplied (e.g., `"iban[DE]"` or `"iban[DEU]"`). - `"swift"`: Business Identifier Codes (also known as SWIFT-BIC, BIC, or SWIFT code) are unique identifiers for financial and non-financial institutions. - `"phone"`, `"email"`, `"url"`, `"ipv4"`, `"ipv6"`, `"mac"`: Phone numbers, email addresses, Internet URLs, IPv4 or IPv6 addresses, and MAC addresses can be validated with their respective keywords. Only a single `spec=` value should be provided per function call. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to a column via `columns=` that is expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with an email column. The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "email": [ "user@example.com", "admin@test.org", "invalid-email", "contact@company.co.uk", ], } ) pb.preview(tbl) ``` Let's validate that all of the values in the `email` column are valid email addresses. We'll determine if this validation had any failing test units (there are four test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_within_spec(columns="email", spec="email") .interrogate() ) validation ``` The validation table shows that one test unit failed (the invalid email address in row 3). col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate column values using a custom expression. The `col_vals_expr()` validation method checks whether column values in a table satisfy a custom `expr=` expression. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- expr A column expression that will evaluate each row in the table, returning a boolean value per table row. If the target table is a Polars DataFrame, the expression should either be a Polars column expression or a Narwhals one. For a Pandas DataFrame, the expression should either be a lambda expression or a Narwhals column expression. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 1, 7, 8, 6], "b": [0, 0, 0, 1, 1, 1], "c": [0.5, 0.3, 0.8, 1.4, 1.9, 1.2], } ) pb.preview(tbl) ``` Let's validate that the values in column `a` are all integers. We'll determine if this validation had any failing test units (there are six test units, one for each row). ```python validation = ( pb.Validate(data=tbl) .col_vals_expr(expr=pl.col("a") % 1 == 0) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows the single entry that corresponds to the validation step created by using `col_vals_expr()`. All test units passed, with no failing test units. rows_distinct(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether rows in the table are distinct. The `rows_distinct()` method checks whether rows in the table are distinct. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- columns_subset A single column or a list of columns to use as a subset for the distinct comparison. If `None`, then all columns in the table will be used for the comparison. If multiple columns are supplied, the distinct comparison will be made over the combination of values in those columns. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns_subset=` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three string columns (`col_1`, `col_2`, and `col_3`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "col_1": ["a", "b", "c", "d"], "col_2": ["a", "a", "c", "d"], "col_3": ["a", "a", "d", "e"], } ) pb.preview(tbl) ``` Let's validate that the rows in the table are distinct with `rows_distinct()`. We'll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not distinct from every other row. ```python validation = ( pb.Validate(data=tbl) .rows_distinct() .interrogate() ) validation ``` From this validation table we see that there are no failing test units. All rows in the table are distinct from one another. We can also use a subset of columns to determine distinctness. Let's specify the subset using columns `col_2` and `col_3` for the next validation. ```python validation = ( pb.Validate(data=tbl) .rows_distinct(columns_subset=["col_2", "col_3"]) .interrogate() ) validation ``` The validation table reports two failing test units. The first and second rows are duplicated when considering only the values in columns `col_2` and `col_3`. There's only one set of duplicates but there are two failing test units since each row is compared to all others. rows_complete(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether row data are complete by having no missing values. The `rows_complete()` method checks whether rows in the table are complete. Completeness of a row means that there are no missing values within the row. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). A subset of columns can be specified for the completeness check. If no subset is provided, all columns in the table will be used. Parameters ---------- columns_subset A single column or a list of columns to use as a subset for the completeness check. If `None` (the default), then all columns in the table will be used. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). Read the *Segmentation* section for usage information. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that you can refer to columns via `columns_subset=` that are expected to be present in the transformed table, but may not exist in the table before preprocessing. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Segmentation ------------ The `segments=` argument allows for the segmentation of a validation step into multiple segments. This is useful for applying the same validation step to different subsets of the data. The segmentation can be done based on a single column or specific fields within a column. Providing a single column name will result in a separate validation step for each unique value in that column. For example, if you have a column called `"region"` with values `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each region. Alternatively, you can provide a tuple that specifies a column name and its corresponding values to segment on. For example, if you have a column called `"date"` and you want to segment on only specific dates, you can provide a tuple like `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded (i.e., no validation steps will be created for them). A list with a combination of column names and tuples can be provided as well. This allows for more complex segmentation scenarios. The following inputs are both valid: ``` # Segments from all unique values in the `region` column # and specific dates in the `date` column segments=["region", ("date", ["2023-01-01", "2023-01-02"])] # Segments from all unique values in the `region` and `date` columns segments=["region", "date"] ``` The segmentation is performed during interrogation, and the resulting validation steps will be numbered sequentially. Each segment will have its own validation step, and the results will be reported separately. This allows for a more granular analysis of the data and helps identify issues within specific segments. Importantly, the segmentation process will be performed after any preprocessing of the data table. Because of this, one can conceivably use the `pre=` argument to generate a column that can be used for segmentation. For example, you could create a new column called `"segment"` through use of `pre=` and then use that column for segmentation. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three string columns (`col_1`, `col_2`, and `col_3`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "col_1": ["a", None, "c", "d"], "col_2": ["a", "a", "c", None], "col_3": ["a", "a", "d", None], } ) pb.preview(tbl) ``` Let's validate that the rows in the table are complete with `rows_complete()`. We'll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not complete (i.e., has at least one missing value). ```python validation = ( pb.Validate(data=tbl) .rows_complete() .interrogate() ) validation ``` From this validation table we see that there are two failing test units. This is because two rows in the table have at least one missing value (the second row and the last row). We can also use a subset of columns to determine completeness. Let's specify the subset using columns `col_2` and `col_3` for the next validation. ```python validation = ( pb.Validate(data=tbl) .rows_complete(columns_subset=["col_2", "col_3"]) .interrogate() ) validation ``` The validation table reports a single failing test units. The last row contains missing values in both the `col_2` and `col_3` columns. others. col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether one or more columns exist in the table. The `col_exists()` method checks whether one or more columns exist in the target table. The only requirement is specification of the column names. Each validation step or expectation will operate over a single test unit, which is whether the column exists or not. Parameters ---------- columns A single column or a list of columns to validate. Can also use [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If multiple columns are supplied or resolved, there will be a separate validation step generated for each column. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step(s) meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with a string columns (`a`) and a numeric column (`b`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["apple", "banana", "cherry", "date"], "b": [1, 6, 3, 5], } ) pb.preview(tbl) ``` Let's validate that the columns `a` and `b` actually exist in the table. We'll determine if this validation had any failing test units (each validation will have a single test unit). ```python validation = ( pb.Validate(data=tbl) .col_exists(columns=["a", "b"]) .interrogate() ) validation ``` Printing the `validation` object shows the validation table in an HTML viewing environment. The validation table shows two entries (one check per column) generated by the `col_exists()` validation step. Both steps passed since both columns provided in `columns=` are present in the table. Now, let's check for the existence of a different set of columns. ```python validation = ( pb.Validate(data=tbl) .col_exists(columns=["b", "c"]) .interrogate() ) validation ``` The validation table reports one passing validation step (the check for column `b`) and one failing validation step (the check for column `c`, which doesn't exist). col_schema_match(self, schema: 'Schema', complete: 'bool' = True, in_order: 'bool' = True, case_sensitive_colnames: 'bool' = True, case_sensitive_dtypes: 'bool' = True, full_match_dtypes: 'bool' = True, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Do columns in the table (and their types) match a predefined schema? The `col_schema_match()` method works in conjunction with an object generated by the [`Schema`](`pointblank.Schema`) class. That class object is the expectation for the actual schema of the target table. The validation step operates over a single test unit, which is whether the schema matches that of the table (within the constraints enforced by the `complete=`, and `in_order=` options). Parameters ---------- schema A `Schema` object that represents the expected schema of the table. This object is generated by the [`Schema`](`pointblank.Schema`) class. complete Should the schema match be complete? If `True`, then the target table must have all columns specified in the schema. If `False`, then the table can have additional columns not in the schema (i.e., the schema is a subset of the target table's columns). in_order Should the schema match be in order? If `True`, then the columns in the schema must appear in the same order as they do in the target table. If `False`, then the order of columns in the schema and the target table can differ. case_sensitive_colnames Should the schema match be case-sensitive with regard to column names? If `True`, then the column names in the schema and the target table must match exactly. If `False`, then the column names are compared in a case-insensitive manner. case_sensitive_dtypes Should the schema match be case-sensitive with regard to column data types? If `True`, then the column data types in the schema and the target table must match exactly. If `False`, then the column data types are compared in a case-insensitive manner. full_match_dtypes Should the schema match require a full match of data types? If `True`, then the column data types in the schema and the target table must match exactly. If `False` then substring matches are allowed, so a schema data type of `Int` would match a target table data type of `Int64`. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three columns (string, integer, and float). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["apple", "banana", "cherry", "date"], "b": [1, 6, 3, 5], "c": [1.1, 2.2, 3.3, 4.4], } ) pb.preview(tbl) ``` Let's validate that the columns in the table match a predefined schema. A schema can be defined using the [`Schema`](`pointblank.Schema`) class. ```python schema = pb.Schema( columns=[("a", "String"), ("b", "Int64"), ("c", "Float64")] ) ``` You can print the schema object to verify that the expected schema is as intended. ```python print(schema) ``` Now, we'll use the `col_schema_match()` method to validate the table against the expected `schema` object. There is a single test unit for this validation step (whether the schema matches the table or not). ```python validation = ( pb.Validate(data=tbl) .col_schema_match(schema=schema) .interrogate() ) validation ``` The validation table shows that the schema matches the table. The single test unit passed since the table columns and their types match the schema. row_count_match(self, count: 'int | FrameT | Any', tol: 'Tolerance' = 0, inverse: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether the row count of the table matches a specified count. The `row_count_match()` method checks whether the row count of the target table matches a specified count. This validation will operate over a single test unit, which is whether the row count matches the specified count. We also have the option to invert the validation step by setting `inverse=True`. This will make the expectation that the row count of the target table *does not* match the specified count. Parameters ---------- count The expected row count of the table. This can be an integer value, a Polars or Pandas DataFrame object, or an Ibis backend table. If a DataFrame/table is provided, the row count of that object will be used as the expected count. tol The tolerance allowable for the row count match. This can be specified as a single numeric value (integer or float) or as a tuple of two integers representing the lower and upper bounds of the tolerance range. If a single integer value (greater than 1) is provided, it represents the absolute bounds of the tolerance, ie. plus or minus the value. If a float value (between 0-1) is provided, it represents the relative tolerance, ie. plus or minus the relative percentage of the target. If a tuple is provided, it represents the lower and upper absolute bounds of the tolerance range. See the examples for more. inverse Should the validation step be inverted? If `True`, then the expectation is that the row count of the target table should not match the specified `count=` value. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use the built in dataset `"small_table"`. The table can be obtained by calling `load_dataset("small_table")`. Let's validate that the number of rows in the table matches a fixed value. In this case, we will use the value `13` as the expected row count. ```python validation = ( pb.Validate(data=small_table) .row_count_match(count=13) .interrogate() ) validation ``` The validation table shows that the expectation value of `13` matches the actual count of rows in the target table. So, the single test unit passed. Let's modify our example to show the different ways we can allow some tolerance to our validation by using the `tol` argument. ```python smaller_small_table = small_table.sample(n = 12) # within the lower bound validation = ( pb.Validate(data=smaller_small_table) .row_count_match(count=13,tol=(2, 0)) # minus 2 but plus 0, ie. 11-13 .interrogate() ) validation validation = ( pb.Validate(data=smaller_small_table) .row_count_match(count=13,tol=.05) # .05% tolerance of 13 .interrogate() ) even_smaller_table = small_table.sample(n = 2) validation = ( pb.Validate(data=even_smaller_table) .row_count_match(count=13,tol=5) # plus or minus 5; this test will fail .interrogate() ) validation ``` col_count_match(self, count: 'int | FrameT | Any', inverse: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether the column count of the table matches a specified count. The `col_count_match()` method checks whether the column count of the target table matches a specified count. This validation will operate over a single test unit, which is whether the column count matches the specified count. We also have the option to invert the validation step by setting `inverse=True`. This will make the expectation that column row count of the target table *does not* match the specified count. Parameters ---------- count The expected column count of the table. This can be an integer value, a Polars or Pandas DataFrame object, or an Ibis backend table. If a DataFrame/table is provided, the column count of that object will be used as the expected count. inverse Should the validation step be inverted? If `True`, then the expectation is that the column count of the target table should not match the specified `count=` value. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use the built in dataset `"game_revenue"`. The table can be obtained by calling `load_dataset("game_revenue")`. Let's validate that the number of columns in the table matches a fixed value. In this case, we will use the value `11` as the expected column count. ```python validation = ( pb.Validate(data=game_revenue) .col_count_match(count=11) .interrogate() ) validation ``` The validation table shows that the expectation value of `11` matches the actual count of columns in the target table. So, the single test unit passed. tbl_match(self, tbl_compare: 'FrameT | Any', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate whether the target table matches a comparison table. The `tbl_match()` method checks whether the target table's composition matches that of a comparison table. The validation performs a comprehensive comparison using progressively stricter checks (from least to most stringent): 1. **Column count match**: both tables must have the same number of columns 2. **Row count match**: both tables must have the same number of rows 3. **Schema match (loose)**: column names and dtypes match (case-insensitive, any order) 4. **Schema match (order)**: columns in the correct order (case-insensitive names) 5. **Schema match (exact)**: column names match exactly (case-sensitive, correct order) 6. **Data match**: values in corresponding cells must be identical This progressive approach helps identify exactly where tables differ. The validation will fail at the first check that doesn't pass, making it easier to diagnose mismatches. This validation operates over a single test unit (pass/fail for complete table match). Parameters ---------- tbl_compare The comparison table to validate against. This can be a DataFrame object (Polars or Pandas), an Ibis table object, or a callable that returns a table. If a callable is provided, it will be executed during interrogation to obtain the comparison table. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Note that the same preprocessing is **not** applied to the comparison table; only the target table is preprocessed. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Cross-Backend Validation ------------------------ The `tbl_match()` method supports **automatic backend coercion** when comparing tables from different backends (e.g., comparing a Polars DataFrame against a Pandas DataFrame, or comparing database tables from DuckDB/SQLite against in-memory DataFrames). When tables with different backends are detected, the comparison table is automatically converted to match the data table's backend before validation proceeds. **Certified Backend Combinations:** All combinations of the following backends have been tested and certified to work (in both directions): - Pandas DataFrame - Polars DataFrame - DuckDB (native) - DuckDB (as Ibis table) - SQLite (via Ibis) Note that database backends (DuckDB, SQLite, PostgreSQL, MySQL, Snowflake, BigQuery) are automatically materialized during validation: - if comparing **against Polars**: materialized to Polars - if comparing **against Pandas**: materialized to Pandas - if **both tables are database backends**: both materialized to Polars This ensures optimal performance and type consistency. **Data Types That Work Best in Cross-Backend Validation:** - numeric types: int, float columns (including proper NaN handling) - string types: text columns with consistent encodings - boolean types: True/False values - null values: `None` and `NaN` are treated as equivalent across backends - list columns: nested list structures (with basic types) **Known Limitations:** While many data types work well in cross-backend validation, there are some known limitations to be aware of: - date/datetime types: When converting between Polars and Pandas, date objects may be represented differently. For example, `datetime.date` objects in Pandas may become `pd.Timestamp` objects when converted from Polars, leading to false mismatches. To work around this, ensure both tables use the same datetime representation before comparison. - custom types: User-defined types or complex nested structures may not convert cleanly between backends and could cause unexpected comparison failures. - categorical types: Categorical/factor columns may have different internal representations across backends. - timezone-aware datetimes: Timezone handling differs between backends and may cause comparison issues. Here are some ideas to overcome such limitations: - for date/datetime columns, consider using `pre=` preprocessing to normalize representations before comparison. - when working with custom types, manually convert tables to the same backend before using `tbl_match()`. - use the same datetime precision (e.g., milliseconds vs microseconds) in both tables. Examples -------- For the examples here, we'll create two simple tables to demonstrate the `tbl_match()` validation. ```python import pointblank as pb import polars as pl # Create the first table tbl_1 = pl.DataFrame({ "a": [1, 2, 3, 4], "b": ["w", "x", "y", "z"], "c": [4.0, 5.0, 6.0, 7.0] }) # Create an identical table tbl_2 = pl.DataFrame({ "a": [1, 2, 3, 4], "b": ["w", "x", "y", "z"], "c": [4.0, 5.0, 6.0, 7.0] }) pb.preview(tbl_1) ``` Let's validate that `tbl_1` matches `tbl_2`. Since these tables are identical, the validation should pass. ```python validation = ( pb.Validate(data=tbl_1) .tbl_match(tbl_compare=tbl_2) .interrogate() ) validation ``` The validation table shows that the single test unit passed, indicating that the two tables match completely. Now, let's create a table with a slight difference and see what happens. ```python # Create a table with one different value tbl_3 = pl.DataFrame({ "a": [1, 2, 3, 4], "b": ["w", "x", "y", "z"], "c": [4.0, 5.5, 6.0, 7.0] # Changed 5.0 to 5.5 }) validation = ( pb.Validate(data=tbl_1) .tbl_match(tbl_compare=tbl_3) .interrogate() ) validation ``` The validation table shows that the single test unit failed because the tables don't match (one value is different in column `c`). conjointly(self, *exprs: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Perform multiple row-wise validations for joint validity. The `conjointly()` validation method checks whether each row in the table passes multiple validation conditions simultaneously. This enables compound validation logic where a test unit (typically a row) must satisfy all specified conditions to pass the validation. This method accepts multiple validation expressions as callables, which should return boolean expressions when applied to the data. You can use lambdas that incorporate Polars/Pandas/Ibis expressions (based on the target table type) or create more complex validation functions. The validation will operate over the number of test units that is equal to the number of rows in the table (determined after any `pre=` mutation has been applied). Parameters ---------- *exprs Multiple validation expressions provided as callable functions. Each callable should accept a table as its single argument and return a boolean expression or Series/Column that evaluates to boolean values for each row. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- For the examples here, we'll use a simple Polars DataFrame with three numeric columns (`a`, `b`, and `c`). The table is shown below: ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2], "c": [10, 4, 8, 9, 10, 5], } ) pb.preview(tbl) ``` Let's validate that the values in each row satisfy multiple conditions simultaneously: 1. Column `a` should be greater than 2 2. Column `b` should be less than 7 3. The sum of `a` and `b` should be less than the value in column `c` We'll use `conjointly()` to check all these conditions together: ```python validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pl.col("a") > 2, lambda df: pl.col("b") < 7, lambda df: pl.col("a") + pl.col("b") < pl.col("c") ) .interrogate() ) validation ``` The validation table shows that not all rows satisfy all three conditions together. For a row to pass the conjoint validation, all three conditions must be true for that row. We can also use preprocessing to filter the data before applying the conjoint validation: ```python # Define preprocessing function for serialization compatibility def filter_by_c_gt_5(df): return df.filter(pl.col("c") > 5) validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pl.col("a") > 2, lambda df: pl.col("b") < 7, lambda df: pl.col("a") + pl.col("b") < pl.col("c"), pre=filter_by_c_gt_5 ) .interrogate() ) validation ``` This allows for more complex validation scenarios where the data is first prepared and then validated against multiple conditions simultaneously. Or, you can use the backend-agnostic column expression helper [`expr_col()`](`pointblank.expr_col`) to write expressions that work across different table backends: ```python tbl = pl.DataFrame( { "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2], "c": [10, 4, 8, 9, 10, 5], } ) # Using backend-agnostic syntax with expr_col() validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pb.expr_col("a") > 2, lambda df: pb.expr_col("b") < 7, lambda df: pb.expr_col("a") + pb.expr_col("b") < pb.expr_col("c") ) .interrogate() ) validation ``` Using [`expr_col()`](`pointblank.expr_col`) allows your validation code to work consistently across Pandas, Polars, and Ibis table backends without changes, making your validation pipelines more portable. See Also -------- Look at the documentation of the [`expr_col()`](`pointblank.expr_col`) function for more information on how to use it with different table backends. specially(self, expr: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Perform a specialized validation with customized logic. The `specially()` validation method allows for the creation of specialized validation expressions that can be used to validate specific conditions or logic in the data. This method provides maximum flexibility by accepting a custom callable that encapsulates your validation logic. The callable function can have one of two signatures: - a function accepting a single parameter (the data table): `def validate(data): ...` - a function with no parameters: `def validate(): ...` The second form is particularly useful for environment validations that don't need to inspect the data table. The callable function must ultimately return one of: 1. a single boolean value or boolean list 2. a table where the final column contains boolean values (column name is unimportant) The validation will operate over the number of test units that is equal to the number of rows in the data table (if returning a table with boolean values). If returning a scalar boolean value, the validation will operate over a single test unit. For a return of a list of boolean values, the length of the list constitutes the number of test units. Parameters ---------- expr A callable function that defines the specialized validation logic. This function should: (1) accept the target data table as its single argument (though it may ignore it), or (2) take no parameters at all (for environment validations). The function must ultimately return boolean values representing validation results. Design your function to incorporate any custom parameters directly within the function itself using closure variables or default parameters. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. Have a look at the *Preprocessing* section for more information on how to use this argument. thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. Look at the *Thresholds* section for information on how to set threshold levels. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Preprocessing ------------- The `pre=` argument allows for a preprocessing function or lambda to be applied to the data table during interrogation. This function should take a table as input and return a modified table. This is useful for performing any necessary transformations or filtering on the data before the validation step is applied. The preprocessing function can be any callable that takes a table as input and returns a modified table. For example, you could use a lambda function to filter the table based on certain criteria or to apply a transformation to the data. Regarding the lifetime of the transformed table, it only exists during the validation step and is not stored in the `Validate` object or used in subsequent validation steps. Thresholds ---------- The `thresholds=` parameter is used to set the failure-condition levels for the validation step. If they are set here at the step level, these thresholds will override any thresholds set at the global level in `Validate(thresholds=...)`. There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values can either be set as a proportion failing of all test units (a value between `0` to `1`), or, the absolute number of failing test units (as integer that's `1` or greater). Thresholds can be defined using one of these input schemes: 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create thresholds) 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is the 'error' level, and position `2` is the 'critical' level 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and 'critical' 4. a single integer/float value denoting absolute number or fraction of failing test units for the 'warning' level only If the number of failing test units exceeds set thresholds, the validation step will be marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be set, you're free to set any combination of them. Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the `actions=` parameter). Examples -------- The `specially()` method offers maximum flexibility for validation, allowing you to create custom validation logic that fits your specific needs. The following examples demonstrate different patterns and use cases for this powerful validation approach. ### Simple validation with direct table access This example shows the most straightforward use case where we create a function that directly checks if the sum of two columns is positive. ```python import pointblank as pb import polars as pl simple_tbl = pl.DataFrame({ "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2] }) # Simple function that validates directly on the table def validate_sum_positive(data): return data.select(pl.col("a") + pl.col("b") > 0) ( pb.Validate(data=simple_tbl) .specially(expr=validate_sum_positive) .interrogate() ) ``` The function returns a Polars DataFrame with a single boolean column indicating whether the sum of columns `a` and `b` is positive for each row. Each row in the resulting DataFrame is a distinct test unit. This pattern works well for simple validations where you don't need configurable parameters. ### Advanced validation with closure variables for parameters When you need to make your validation configurable, you can use the function factory pattern (also known as closures) to create parameterized validations: ```python # Create a parameterized validation function using closures def make_column_ratio_validator(col1, col2, min_ratio): def validate_column_ratio(data): return data.select((pl.col(col1) / pl.col(col2)) > min_ratio) return validate_column_ratio ( pb.Validate(data=simple_tbl) .specially( expr=make_column_ratio_validator(col1="a", col2="b", min_ratio=0.5) ) .interrogate() ) ``` This approach allows you to create reusable validation functions that can be configured with different parameters without modifying the function itself. ### Validation function returning a list of booleans This example demonstrates how to create a validation function that returns a list of boolean values, where each element represents a separate test unit: ```python import pointblank as pb import polars as pl import random # Create sample data transaction_tbl = pl.DataFrame({ "transaction_id": [f"TX{i:04d}" for i in range(1, 11)], "amount": [120.50, 85.25, 50.00, 240.75, 35.20, 150.00, 85.25, 65.00, 210.75, 90.50], "category": ["food", "shopping", "entertainment", "travel", "utilities", "food", "shopping", "entertainment", "travel", "utilities"] }) # Define a validation function that returns a list of booleans def validate_transaction_rules(data): # Create a list to store individual test results test_results = [] # Check each row individually against multiple business rules for row in data.iter_rows(named=True): # Rule: transaction IDs must start with "TX" and be 6 chars long valid_id = row["transaction_id"].startswith("TX") and len(row["transaction_id"]) == 6 # Rule: Amounts must be appropriate for their category valid_amount = True if row["category"] == "food" and (row["amount"] < 10 or row["amount"] > 200): valid_amount = False elif row["category"] == "utilities" and (row["amount"] < 20 or row["amount"] > 300): valid_amount = False elif row["category"] == "entertainment" and row["amount"] > 100: valid_amount = False # A transaction passes if it satisfies both rules test_results.append(valid_id and valid_amount) return test_results ( pb.Validate(data=transaction_tbl) .specially( expr=validate_transaction_rules, brief="Validate transaction IDs and amounts by category." ) .interrogate() ) ``` This example shows how to create a validation function that applies multiple business rules to each row and returns a list of boolean results. Each boolean in the list represents a separate test unit, and a test unit passes only if all rules are satisfied for a given row. The function iterates through each row in the data table, checking: 1. if transaction IDs follow the required format 2. if transaction amounts are appropriate for their respective categories This approach is powerful when you need to apply complex, conditional logic that can't be easily expressed using the built-in validation functions. ### Table-level validation returning a single boolean Sometimes you need to validate properties of the entire table rather than row-by-row. In these cases, your function can return a single boolean value: ```python def validate_table_properties(data): # Check if table has at least one row with column 'a' > 10 has_large_values = data.filter(pl.col("a") > 10).height > 0 # Check if mean of column 'b' is positive has_positive_mean = data.select(pl.mean("b")).item() > 0 # Return a single boolean for the entire table return has_large_values and has_positive_mean ( pb.Validate(data=simple_tbl) .specially(expr=validate_table_properties) .interrogate() ) ``` This example demonstrates how to perform multiple checks on the table as a whole and combine them into a single validation result. ### Environment validation that doesn't use the data table The `specially()` validation method can even be used to validate aspects of your environment that are completely independent of the data: ```python def validate_pointblank_version(): try: import importlib.metadata version = importlib.metadata.version("pointblank") version_parts = version.split(".") # Get major and minor components regardless of how many parts there are major = int(version_parts[0]) minor = int(version_parts[1]) # Check both major and minor components for version `0.9+` return (major > 0) or (major == 0 and minor >= 9) except Exception as e: # More specific error handling could be added here print(f"Version check failed: {e}") return False ( pb.Validate(data=simple_tbl) .specially( expr=validate_pointblank_version, brief="Check Pointblank version `>=0.9.0`." ) .interrogate() ) ``` This pattern shows how to validate external dependencies or environment conditions as part of your validation workflow. Notice that the function doesn't take any parameters at all, which makes it cleaner when the validation doesn't need to access the data table. By combining these patterns, you can create sophisticated validation workflows that address virtually any data quality requirement in your organization. prompt(self, prompt: 'str', model: 'str', columns_subset: 'str | list[str] | None' = None, batch_size: 'int' = 1000, max_concurrent: 'int' = 3, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate' Validate rows using AI/LLM-powered analysis. The `prompt()` validation method uses Large Language Models (LLMs) to validate rows of data based on natural language criteria. Similar to other Pointblank validation methods, this generates binary test results (pass/fail) that integrate seamlessly with the standard reporting framework. Like `col_vals_*()` methods, `prompt()` evaluates data against specific criteria, but instead of using programmatic rules, it uses natural language prompts interpreted by an LLM. Like `rows_distinct()` and `rows_complete()`, it operates at the row level and allows you to specify a subset of columns for evaluation using `columns_subset=`. The system automatically combines your validation criteria from the `prompt=` parameter with the necessary technical context, data formatting instructions, and response structure requirements. This is all so you only need to focus on describing your validation logic in plain language. Each row becomes a test unit that either passes or fails the validation criteria, producing the familiar True/False results that appear in Pointblank validation reports. This method is particularly useful for complex validation rules that are difficult to express with traditional validation methods, such as semantic checks, context-dependent validation, or subjective quality assessments. Parameters ---------- prompt A natural language description of the validation criteria. This prompt should clearly describe what constitutes valid vs invalid rows. Some examples: `"Each row should contain a valid email address and a realistic person name"`, `"Values should indicate positive sentiment"`, `"The description should mention a country name"`. columns_subset A single column or list of columns to include in the validation. If `None`, all columns will be included. Specifying fewer columns can improve performance and reduce API costs so try to include only the columns necessary for the validation. model The model to be used. This should be in the form of `provider:model` (e.g., `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`, `"ollama"`, and `"bedrock"`. The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. batch_size Number of rows to process in each batch. Larger batches are more efficient but may hit API limits. Default is `1000`. max_concurrent Maximum number of concurrent API requests. Higher values speed up processing but may hit rate limits. Default is `3`. pre An optional preprocessing function or lambda to apply to the data table during interrogation. This function should take a table as input and return a modified table. segments An optional directive on segmentation, which serves to split a validation step into multiple (one step per segment). Can be a single column name, a tuple that specifies a column name and its corresponding values to segment on, or a combination of both (provided as a list). thresholds Set threshold failure levels for reporting and reacting to exceedences of the levels. The thresholds are set at the step level and will override any global thresholds set in `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will be set locally and global thresholds (if any) will take effect. actions Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to define the actions. brief An optional brief description of the validation step that will be displayed in the reporting table. You can use the templating elements like `"{step}"` to insert the step number, or `"{auto}"` to include an automatically generated brief. If `True` the entire brief will be automatically generated. If `None` (the default) then there won't be a brief. active A boolean value indicating whether the validation step should be active. Using `False` will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged). Returns ------- Validate The `Validate` object with the added validation step. Constructing the `model` Argument --------------------------------- The `model=` argument should be constructed using the provider and model name separated by a colon (`provider:model`). The provider text can any of: - `"anthropic"` (Anthropic) - `"openai"` (OpenAI) - `"ollama"` (Ollama) - `"bedrock"` (Amazon Bedrock) The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. Notes on Authentication ----------------------- API keys are automatically loaded from environment variables or `.env` files and are **not** stored in the validation object for security reasons. You should consider using a secure method for handling API keys. One way to do this is to load the API key from an environment variable and retrieve it using the `os` module (specifically the `os.getenv()` function). Places to store the API key might include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`. Another solution is to store one or more model provider API keys in an `.env` file (in the root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`) then the AI validation will automatically load the API key from the `.env` file. An `.env` file might look like this: ```plaintext ANTHROPIC_API_KEY="your_anthropic_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ``` There's no need to have the `python-dotenv` package installed when using `.env` files in this way. **Provider-specific setup**: - **OpenAI**: set `OPENAI_API_KEY` environment variable or create `.env` file - **Anthropic**: set `ANTHROPIC_API_KEY` environment variable or create `.env` file - **Ollama**: no API key required, just ensure Ollama is running locally - **Bedrock**: configure AWS credentials through standard AWS methods AI Validation Process --------------------- The AI validation process works as follows: 1. data batching: the data is split into batches of the specified size 2. row deduplication: duplicate rows (based on selected columns) are identified and only unique combinations are sent to the LLM for analysis 3. json conversion: each batch of unique rows is converted to JSON format for the LLM 4. prompt construction: the user prompt is embedded in a structured system prompt 5. llm processing: each batch is sent to the LLM for analysis 6. response parsing: LLM responses are parsed to extract validation results 7. result projection: results are mapped back to all original rows using row signatures 8. result aggregation: results from all batches are combined **Performance Optimization**: the process uses row signature memoization to avoid redundant LLM calls. When multiple rows have identical values in the selected columns, only one representative row is validated, and the result is applied to all matching rows. This can dramatically reduce API costs and processing time for datasets with repetitive patterns. The LLM receives data in this JSON format: ```json { "columns": ["col1", "col2", "col3"], "rows": [ {"col1": "value1", "col2": "value2", "col3": "value3", "_pb_row_index": 0}, {"col1": "value4", "col2": "value5", "col3": "value6", "_pb_row_index": 1} ] } ``` The LLM returns validation results in this format: ```json [ {"index": 0, "result": true}, {"index": 1, "result": false} ] ``` Prompt Design Tips ------------------ For best results, design prompts that are: - boolean-oriented: frame validation criteria to elicit clear valid/invalid responses - specific: clearly define what makes a row valid/invalid - unambiguous: avoid subjective language that could be interpreted differently - context-aware: include relevant business rules or domain knowledge - example-driven: consider providing examples in the prompt when helpful **Critical**: Prompts must be designed so the LLM can determine whether each row passes or fails the validation criteria. The system expects binary validation responses, so avoid open-ended questions or prompts that might generate explanatory text instead of clear pass/fail judgments. Good prompt examples: - "Each row should contain a valid email address in the 'email' column and a non-empty name in the 'name' column" - "The 'sentiment' column should contain positive sentiment words (happy, good, excellent, etc.)" - "Product descriptions should mention at least one technical specification" Poor prompt examples (avoid these): - "What do you think about this data?" (too open-ended) - "Describe the quality of each row" (asks for description, not validation) - "How would you improve this data?" (asks for suggestions, not pass/fail) Performance Considerations -------------------------- AI validation is significantly slower than traditional validation methods due to API calls to LLM providers. However, performance varies dramatically based on data characteristics: **High Memoization Scenarios** (seconds to minutes): - data with many duplicate rows in the selected columns - low cardinality data (repeated patterns) - small number of unique row combinations **Low Memoization Scenarios** (minutes to hours): - high cardinality data with mostly unique rows - large datasets with few repeated patterns - all or most rows requiring individual LLM evaluation The row signature memoization optimization can reduce processing time significantly when data has repetitive patterns. For datasets where every row is unique, expect longer processing times similar to validating each row individually. **Strategies to Reduce Processing Time**: - test on data slices: define a sampling function like `def sample_1000(df): return df.head(1000)` and use `pre=sample_1000` to validate on smaller samples - filter relevant data: define filter functions like `def active_only(df): return df.filter(df["status"] == "active")` and use `pre=active_only` to focus on a specific subset - optimize column selection: use `columns_subset=` to include only the columns necessary for validation - start with smaller batches: begin with `batch_size=100` for testing, then increase gradually - reduce concurrency: lower `max_concurrent=1` if hitting rate limits - use faster/cheaper models: consider using smaller or more efficient models for initial testing before switching to more capable models Examples -------- The following examples demonstrate how to use AI validation for different types of data quality checks. These examples show both basic usage and more advanced configurations with custom thresholds and actions. **Basic AI validation example:** This first example shows a simple validation scenario where we want to check that customer records have both valid email addresses and non-empty names. Notice how we use `columns_subset=` to focus only on the relevant columns, which improves both performance and cost-effectiveness. ```python import pointblank as pb import polars as pl # Sample data with email and name columns tbl = pl.DataFrame({ "email": ["john@example.com", "invalid-email", "jane@test.org"], "name": ["John Doe", "", "Jane Smith"], "age": [25, 30, 35] }) # Validate using AI validation = ( pb.Validate(data=tbl) .prompt( prompt="Each row should have a valid email address and a non-empty name", columns_subset=["email", "name"], # Only check these columns model="openai:gpt-4o-mini", ) .interrogate() ) validation ``` In this example, the AI will identify that the second row fails validation because it has an invalid email format (`"invalid-email"`) and the third row also fails because it has an empty name field. The validation results will show 2 out of 3 rows failing the criteria. **Advanced example with custom thresholds:** This more sophisticated example demonstrates how to use AI validation with custom thresholds and actions. Here we're validating phone number formats to ensure they include area codes, which is a common data quality requirement for customer contact information. ```python customer_data = pl.DataFrame({ "customer_id": [1, 2, 3, 4, 5], "name": ["John Doe", "Jane Smith", "Bob Johnson", "Alice Brown", "Charlie Davis"], "phone_number": [ "(555) 123-4567", # Valid with area code "555-987-6543", # Valid with area code "123-4567", # Missing area code "(800) 555-1234", # Valid with area code "987-6543" # Missing area code ] }) validation = ( pb.Validate(data=customer_data) .prompt( prompt="Do all the phone numbers include an area code?", columns_subset="phone_number", # Only check the `phone_number` column model="openai:gpt-4o", batch_size=500, max_concurrent=5, thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3), actions=pb.Actions(error="Too many phone numbers missing area codes.") ) .interrogate() ) ``` This validation will identify that 2 out of 5 phone numbers (40%) are missing area codes, which exceeds all threshold levels. The validation will trigger the specified error action since the failure rate (40%) is above the error threshold (20%). The AI can recognize various phone number formats and determine whether they include area codes. ## The Column Selection family A flexible way to select columns for validation is to use the `col()` function along with column selection helper functions. A combination of `col()` + `starts_with()`, `matches()`, etc., allows for the selection of multiple target columns (mapping a validation across many steps). Furthermore, the `col()` function can be used to declare a comparison column (e.g., for the `value=` argument in many `col_vals_*()` methods) when you can't use a fixed value for comparison. col(exprs: 'str | ColumnSelector | ColumnSelectorNarwhals') -> 'Column | ColumnLiteral | ColumnSelectorNarwhals' Helper function for referencing a column in the input table. Many of the validation methods (i.e., `col_vals_*()` methods) in Pointblank have a `value=` argument. These validations are comparisons between column values and a literal value, or, between column values and adjacent values in another column. The `col()` helper function is used to specify that it is a column being referenced, not a literal value. The `col()` doesn't check that the column exists in the input table. It acts to signal that the value being compared is a column value. During validation (i.e., when [`interrogate()`](`pointblank.Validate.interrogate`) is called), Pointblank will then check that the column exists in the input table. For creating expressions to use with the `conjointly()` validation method, use the [`expr_col()`](`pointblank.expr_col`) function instead. Parameters ---------- exprs Either the name of a single column in the target table, provided as a string, or, an expression involving column selector functions (e.g., `starts_with("a")`, `ends_with("e") | starts_with("a")`, etc.). Returns ------- Column | ColumnLiteral | ColumnSelectorNarwhals: A column object or expression representing the column reference. Usage with the `columns=` Argument ----------------------------------- The `col()` function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) If specifying a single column with certainty (you have the exact name), `col()` is not necessary since you can just pass the column name as a string (though it is still valid to use `col("column_name")`, if preferred). However, if you want to select columns based on complex logic involving multiple column selector functions (e.g., columns that start with `"a"` but don't end with `"e"`), you need to use `col()` to wrap expressions involving column selector functions and logical operators such as `&`, `|`, `-`, and `~`. Here is an example of such usage with the [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) validation method: ```python col_vals_gt(columns=col(starts_with("a") & ~ends_with("e")), value=10) ``` If using only a single column selector function, you can pass the function directly to the `columns=` argument of the validation method, or, you can use `col()` to wrap the function (either is valid though the first is more concise). Here is an example of that simpler usage: ```python col_vals_gt(columns=starts_with("a"), value=10) ``` Usage with the `value=`, `left=`, and `right=` Arguments -------------------------------------------------------- The `col()` function can be used in the `value=` argument of the following validation methods - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) and in the `left=` and `right=` arguments (either or both) of these two validation methods - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) You cannot use column selector functions such as [`starts_with()`](`pointblank.starts_with`) in either of the `value=`, `left=`, or `right=` arguments since there would be no guarantee that a single column will be resolved from the target table with this approach. The `col()` function is used to signal that the value being compared is a column value and not a literal value. Available Selectors ------------------- There is a collection of selectors available in pointblank, allowing you to select columns based on attributes of column names and positions. The selectors are: - [`starts_with()`](`pointblank.starts_with`) - [`ends_with()`](`pointblank.ends_with`) - [`contains()`](`pointblank.contains`) - [`matches()`](`pointblank.matches`) - [`everything()`](`pointblank.everything`) - [`first_n()`](`pointblank.first_n`) - [`last_n()`](`pointblank.last_n`) Alternatively, we support selectors from the Narwhals library! Those selectors can additionally take advantage of the data types of the columns. The selectors are: - `boolean()` - `by_dtype()` - `categorical()` - `matches()` - `numeric()` - `string()` Have a look at the [Narwhals API documentation on selectors](https://narwhals-dev.github.io/narwhals/api-reference/selectors/) for more information. Examples -------- Suppose we have a table with columns `a` and `b` and we'd like to validate that the values in column `a` are greater than the values in column `b`. We can use the `col()` helper function to reference the comparison column when creating the validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 7, 6, 5], "b": [4, 2, 3, 3, 4, 3], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=pb.col("b")) .interrogate() ) validation ``` From results of the validation table it can be seen that values in `a` were greater than values in `b` for every row (or test unit). Using `value=pb.col("b")` specified that the greater-than comparison is across columns, not with a fixed literal value. If you want to select an arbitrary set of columns upon which to base a validation, you can use column selector functions (e.g., [`starts_with()`](`pointblank.starts_with`), [`ends_with()`](`pointblank.ends_with`), etc.) to specify columns in the `columns=` argument of a validation method. Let's use the [`starts_with()`](`pointblank.starts_with`) column selector function to select columns that start with `"paid"` and validate that the values in those columns are greater than `10`. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [16.32, 16.25, 15.75], "paid_2022": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.col(pb.starts_with("paid")), value=10) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the [`starts_with()`](`pointblank.starts_with`) column selector function. This is not strictly necessary when using a single column selector function, so `columns=pb.starts_with("paid")` would be equivalent usage here. However, the use of `col()` is required when using multiple column selector functions with logical operators. Here is an example of that more complex usage: ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "hours_2022": [160, 180, 160], "hours_2023": [182, 168, 175], "hours_2024": [200, 165, 190], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.starts_with("paid") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the [`starts_with()`](`pointblank.starts_with`) and [`matches()`](`pointblank.matches`) column selector functions, combined with the `&` operator. This is necessary to specify the set of columns that start with `"paid"` *and* match the text `"2023"` or `"2024"`. If you'd like to take advantage of Narwhals selectors, that's also possible. Here is an example of using the `numeric()` column selector function to select all numeric columns for validation, checking that their values are greater than `0`. ```python import narwhals.selectors as ncs tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "hours_2022": [160, 180, 160], "hours_2023": [182, 168, 175], "hours_2024": [200, 165, 190], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_ge(columns=pb.col(ncs.numeric()), value=0) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the `numeric()` column selector function from Narwhals. As with the other selectors, this is not strictly necessary when using a single column selector, so `columns=ncs.numeric()` would also be fine here. Narwhals selectors can also use operators to combine multiple selectors. Here is an example of using the `numeric()` and [`matches()`](`pointblank.matches`) selectors together to select all numeric columns that fit a specific pattern. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_status": ["ft", "ft", "pt"], "2023_status": ["ft", "pt", "ft"], "2024_status": ["ft", "pt", "ft"], "2022_pay_total": [18.62, 16.95, 18.25], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_lt(columns=pb.col(ncs.numeric() & ncs.matches("2023|2024")), value=30) .interrogate() ) validation ``` In the above example the `col()` function contains the invocation of the `numeric()` and [`matches()`](`pointblank.matches`) column selector functions from Narwhals, combined with the `&` operator. This is necessary to specify the set of columns that are numeric *and* match the text `"2023"` or `"2024"`. See Also -------- Create a column expression for use in `conjointly()` validation with the [`expr_col()`](`pointblank.expr_col`) function. starts_with(text: 'str', case_sensitive: 'bool' = False) -> 'StartsWith' Select columns that start with specified text. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `starts_with()` selector function can be used to select one or more columns that start with some specified text. So if the set of table columns consists of `[name_first, name_last, age, address]` and you want to validate columns that start with `"name"`, you can use `columns=starts_with("name")`. This will select the `name_first` and `name_last` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `starts_with()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- text The text that the column name should start with. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- StartsWith A `StartsWith` object, which can be used to select columns that start with the specified text. Relevant Validation Methods where `starts_with()` can be Used ------------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `starts_with()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `starts_with()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that start with `"a"` and end with `"e"`, you can use the `starts_with()` and [`ends_with()`](`pointblank.ends_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(starts_with("a") & ends_with("e")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with columns `name`, `paid_2021`, `paid_2022`, and `person_id` and we'd like to validate that the values in columns that start with `"paid"` are greater than `10`. We can use the `starts_with()` column selector function to specify the columns that start with `"paid"` as the columns to validate. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [16.32, 16.25, 15.75], "paid_2022": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.starts_with("paid"), value=10) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `paid_2021` and one for `paid_2022`. The values in both columns were all greater than `10`. We can also use the `starts_with()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that start with `"paid"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "hours_2022": [160, 180, 160], "hours_2023": [182, 168, 175], "hours_2024": [200, 165, 190], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.starts_with("paid") & pb.matches("23|24")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `paid_2023` and one for `paid_2024`. ends_with(text: 'str', case_sensitive: 'bool' = False) -> 'EndsWith' Select columns that end with specified text. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `ends_with()` selector function can be used to select one or more columns that end with some specified text. So if the set of table columns consists of `[first_name, last_name, age, address]` and you want to validate columns that end with `"name"`, you can use `columns=ends_with("name")`. This will select the `first_name` and `last_name` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `ends_with()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- text The text that the column name should end with. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- EndsWith An `EndsWith` object, which can be used to select columns that end with the specified text. Relevant Validation Methods where `ends_with()` can be Used ----------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `ends_with()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `ends_with()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that end with `"e"` and start with `"a"`, you can use the `ends_with()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(ends_with("e") & starts_with("a")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with columns `name`, `2021_pay`, `2022_pay`, and `person_id` and we'd like to validate that the values in columns that end with `"pay"` are greater than `10`. We can use the `ends_with()` column selector function to specify the columns that end with `"pay"` as the columns to validate. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2021_pay": [16.32, 16.25, 15.75], "2022_pay": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.ends_with("pay"), value=10) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2021_pay` and one for `2022_pay`. The values in both columns were all greater than `10`. We can also use the `ends_with()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that end with `"pay"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_hours": [160, 180, 160], "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2022_pay": [18.62, 16.95, 18.25], "2023_pay": [19.29, 17.75, 18.35], "2024_pay": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.ends_with("pay") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2023_pay` and one for `2024_pay`. contains(text: 'str', case_sensitive: 'bool' = False) -> 'Contains' Select columns that contain specified text. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `contains()` selector function can be used to select one or more columns that contain some specified text. So if the set of table columns consists of `[profit, conv_first, conv_last, highest_conv, age]` and you want to validate columns that have `"conv"` in the name, you can use `columns=contains("conv")`. This will select the `conv_first`, `conv_last`, and `highest_conv` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `contains()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- text The text that the column name should contain. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- Contains A `Contains` object, which can be used to select columns that contain the specified text. Relevant Validation Methods where `contains()` can be Used ---------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `contains()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `contains()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that have the text `"_n"` and start with `"item"`, you can use the `contains()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(contains("_n") & starts_with("item")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with columns `name`, `2021_pay_total`, `2022_pay_total`, and `person_id` and we'd like to validate that the values in columns having `"pay"` in the name are greater than `10`. We can use the `contains()` column selector function to specify the column names that contain `"pay"` as the columns to validate. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2021_pay_total": [16.32, 16.25, 15.75], "2022_pay_total": [18.62, 16.95, 18.25], "person_id": ["A123", "B456", "C789"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.contains("pay"), value=10) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2021_pay_total` and one for `2022_pay_total`. The values in both columns were all greater than `10`. We can also use the `contains()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that contain `"pay"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_hours": [160, 180, 160], "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2022_pay_total": [18.62, 16.95, 18.25], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.contains("pay") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2023_pay_total` and one for `2024_pay_total`. matches(pattern: 'str', case_sensitive: 'bool' = False) -> 'Matches' Select columns that match a specified regular expression pattern. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `matches()` selector function can be used to select one or more columns matching a provided regular expression pattern. So if the set of table columns consists of `[rev_01, rev_02, profit_01, profit_02, age]` and you want to validate columns that have two digits at the end of the name, you can use `columns=matches(r"[0-9]{2}$")`. This will select the `rev_01`, `rev_02`, `profit_01`, and `profit_02` columns. There will be a validation step created for every resolved column. Note that if there aren't any columns resolved from using `matches()` (or any other expression using selector functions), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- pattern The regular expression pattern that the column name should match. case_sensitive Whether column names should be treated as case-sensitive. The default is `False`. Returns ------- Matches A `Matches` object, which can be used to select columns that match the specified pattern. Relevant Validation Methods where `matches()` can be Used --------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `matches()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `matches()` function can be composed with other column selectors to create fine-grained column selections. For example, to select columns that have the text starting with five digits and end with `"_id"`, you can use the `matches()` and [`ends_with()`](`pointblank.ends_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(matches(r"^[0-9]{5}") & ends_with("_id")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with columns `name`, `id_old`, `new_identifier`, and `pay_2021` and we'd like to validate that text values in columns having `"id"` or `"identifier"` in the name have a specific syntax. We can use the `matches()` column selector function to specify the columns that match the pattern. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "id_old": ["ID0021", "ID0032", "ID0043"], "new_identifier": ["ID9054", "ID9065", "ID9076"], "pay_2021": [16.32, 16.25, 15.75], } ) validation = ( pb.Validate(data=tbl) .col_vals_regex(columns=pb.matches("id|identifier"), pattern=r"ID[0-9]{4}") .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `id_old` and one for `new_identifier`. The values in both columns all match the pattern `"ID[0-9]{4}"`. We can also use the `matches()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select columns that contain `"pay"` and match the text `"2023"` or `"2024"`, we can use the `&` operator to combine column selectors. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "2022_hours": [160, 180, 160], "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2022_pay_total": [18.62, 16.95, 18.25], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns=pb.col(pb.contains("pay") & pb.matches("2023|2024")), value=10 ) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2023_pay_total` and one for `2024_pay_total`. everything() -> 'Everything' Select all columns. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `everything()` selector function can be used to select every column in the table. If you have a table with six columns and they're all suitable for a specific type of validation, you can use `columns=everything())` and all six columns will be selected for validation. Returns ------- Everything An `Everything` object, which can be used to select all columns. Relevant Validation Methods where `everything()` can be Used ------------------------------------------------------------ This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `everything()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `everything()` function can be composed with other column selectors to create fine-grained column selections. For example, to select all column names except those having starting with "id_", you can use the `everything()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(everything() - starts_with("id_")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with several numeric columns and we'd like to validate that all these columns have less than `1000`. We can use the `everything()` column selector function to select all columns for validation. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_lt(columns=pb.everything(), value=1000) .interrogate() ) validation ``` From the results of the validation table we get four validation steps, one each column in the table. The values in every column were all lower than `1000`. We can also use the `everything()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select every column except those that begin with `"2023"` we can use the `-` operator to combine column selectors. ```python tbl = pl.DataFrame( { "2023_hours": [182, 168, 175], "2024_hours": [200, 165, 190], "2023_pay_total": [19.29, 17.75, 18.35], "2024_pay_total": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_lt(columns=pb.col(pb.everything() - pb.starts_with("2023")), value=1000) .interrogate() ) validation ``` From the results of the validation table we get two validation steps, one for `2024_hours` and one for `2024_pay_total`. first_n(n: 'int', offset: 'int' = 0) -> 'FirstN' Select the first `n` columns in the column list. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `first_n()` selector function can be used to select *n* columns positioned at the start of the column list. So if the set of table columns consists of `[rev_01, rev_02, profit_01, profit_02, age]` and you want to validate the first two columns, you can use `columns=first_n(2)`. This will select the `rev_01` and `rev_02` columns and a validation step will be created for each. The `offset=` parameter can be used to skip a certain number of columns from the start of the column list. So if you want to select the third and fourth columns, you can use `columns=first_n(2, offset=2)`. Parameters ---------- n The number of columns to select from the start of the column list. Should be a positive integer value. If `n` is greater than the number of columns in the table, all columns will be selected. offset The offset from the start of the column list. The default is `0`. If `offset` is greater than the number of columns in the table, no columns will be selected. Returns ------- FirstN A `FirstN` object, which can be used to select the first `n` columns. Relevant Validation Methods where `first_n()` can be Used --------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `first_n()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `first_n()` function can be composed with other column selectors to create fine-grained column selections. For example, to select all column names starting with "rev" along with the first two columns, you can use the `first_n()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(first_n(2) | starts_with("rev")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with columns `paid_2021`, `paid_2022`, `paid_2023`, `paid_2024`, and `name` and we'd like to validate that the values in the first four columns are greater than `10`. We can use the `first_n()` column selector function to specify that the first four columns in the table are the columns to validate. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], "name": ["Alice", "Bob", "Charlie"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.first_n(4), value=10) .interrogate() ) validation ``` From the results of the validation table we get four validation steps. The values in all those columns were all greater than `10`. We can also use the `first_n()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select the first four columns but also omit those columns that end with `"2023"`, we can use the `-` operator to combine column selectors. ```python tbl = pl.DataFrame( { "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], "name": ["Alice", "Bob", "Charlie"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.col(pb.first_n(4) - pb.ends_with("2023")), value=10) .interrogate() ) validation ``` From the results of the validation table we get three validation steps, one for `paid_2021`, `paid_2022`, and `paid_2024`. last_n(n: 'int', offset: 'int' = 0) -> 'LastN' Select the last `n` columns in the column list. Many validation methods have a `columns=` argument that can be used to specify the columns for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). The `last_n()` selector function can be used to select *n* columns positioned at the end of the column list. So if the set of table columns consists of `[age, rev_01, rev_02, profit_01, profit_02]` and you want to validate the last two columns, you can use `columns=last_n(2)`. This will select the `profit_01` and `profit_02` columns and a validation step will be created for each. The `offset=` parameter can be used to skip a certain number of columns from the end of the column list. So if you want to select the third and fourth columns from the end, you can use `columns=last_n(2, offset=2)`. Parameters ---------- n The number of columns to select from the end of the column list. Should be a positive integer value. If `n` is greater than the number of columns in the table, all columns will be selected. offset The offset from the end of the column list. The default is `0`. If `offset` is greater than the number of columns in the table, no columns will be selected. Returns ------- LastN A `LastN` object, which can be used to select the last `n` columns. Relevant Validation Methods where `last_n()` can be Used -------------------------------------------------------- This selector function can be used in the `columns=` argument of the following validation methods: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_exists()`](`pointblank.Validate.col_exists`) The `last_n()` selector function doesn't need to be used in isolation. Read the next section for information on how to compose it with other column selectors for more refined ways to select columns. Additional Flexibilty through Composition with Other Column Selectors --------------------------------------------------------------------- The `last_n()` function can be composed with other column selectors to create fine-grained column selections. For example, to select all column names starting with "rev" along with the last two columns, you can use the `last_n()` and [`starts_with()`](`pointblank.starts_with`) functions together. The only condition is that the expressions are wrapped in the [`col()`](`pointblank.col`) function, like this: ```python col(last_n(2) | starts_with("rev")) ``` There are four operators that can be used to compose column selectors: - `&` (*and*) - `|` (*or*) - `-` (*difference*) - `~` (*not*) The `&` operator is used to select columns that satisfy both conditions. The `|` operator is used to select columns that satisfy either condition. The `-` operator is used to select columns that satisfy the first condition but not the second. The `~` operator is used to select columns that don't satisfy the condition. As many selector functions can be used as needed and the operators can be combined to create complex column selection criteria (parentheses can be used to group conditions and control the order of evaluation). Examples -------- Suppose we have a table with columns `name`, `paid_2021`, `paid_2022`, `paid_2023`, and `paid_2024` and we'd like to validate that the values in the last four columns are greater than `10`. We can use the `last_n()` column selector function to specify that the last four columns in the table are the columns to validate. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.last_n(4), value=10) .interrogate() ) validation ``` From the results of the validation table we get four validation steps. The values in all those columns were all greater than `10`. We can also use the `last_n()` function in combination with other column selectors (within [`col()`](`pointblank.col`)) to create more complex column selection criteria (i.e., to select columns that satisfy multiple conditions). For example, to select the last four columns but also omit those columns that end with `"2023"`, we can use the `-` operator to combine column selectors. ```python tbl = pl.DataFrame( { "name": ["Alice", "Bob", "Charlie"], "paid_2021": [17.94, 16.55, 17.85], "paid_2022": [18.62, 16.95, 18.25], "paid_2023": [19.29, 17.75, 18.35], "paid_2024": [20.73, 18.35, 20.10], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns=pb.col(pb.last_n(4) - pb.ends_with("2023")), value=10) .interrogate() ) validation ``` From the results of the validation table we get three validation steps, one for `paid_2021`, `paid_2022`, and `paid_2024`. expr_col(column_name: 'str') -> 'ColumnExpression' Create a column expression for use in `conjointly()` validation. This function returns a ColumnExpression object that supports operations like `>`, `<`, `+`, etc. for use in [`conjointly()`](`pointblank.Validate.conjointly`) validation expressions. Parameters ---------- column_name The name of the column to reference. Returns ------- ColumnExpression A column expression that can be used in comparisons and operations. Examples -------- Let's say we have a table with three columns: `a`, `b`, and `c`. We want to validate that: - The values in column `a` are greater than `2`. - The values in column `b` are less than `7`. - The sum of columns `a` and `b` is less than the values in column `c`. We can use the `expr_col()` function to create a column expression for each of these conditions. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 7, 1, 3, 9, 4], "b": [6, 3, 0, 5, 8, 2], "c": [10, 4, 8, 9, 10, 5], } ) # Using expr_col() to create backend-agnostic validation expressions validation = ( pb.Validate(data=tbl) .conjointly( lambda df: pb.expr_col("a") > 2, lambda df: pb.expr_col("b") < 7, lambda df: pb.expr_col("a") + pb.expr_col("b") < pb.expr_col("c") ) .interrogate() ) validation ``` The above code creates a validation object that checks the specified conditions using the `expr_col()` function. The resulting validation table will show whether each condition was satisfied for each row in the table. See Also -------- The [`conjointly()`](`pointblank.Validate.conjointly`) validation method, which is where this function should be used. ## The Segments family Combine multiple values into a single segment using `seg_*()` helper functions. seg_group(values: 'list[Any]') -> 'Segment' Group together values for segmentation. Many validation methods have a `segments=` argument that can be used to specify one or more columns, or certain values within a column, to create segments for validation (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). When passing in a column, or a tuple with a column and certain values, a segment will be created for each individual value within the column or given values. The `seg_group()` selector enables values to be grouped together into a segment. For example, if you were to create a segment for a column "region", investigating just "North" and "South" regions, a typical segment would look like: `segments=("region", ["North", "South"])` This would create two validation steps, one for each of the regions. If you wanted to group these two regions into a single segment, you could use the `seg_group()` function like this: `segments=("region", pb.seg_group(["North", "South"]))` You could create a second segment for "East" and "West" regions like this: `segments=("region", pb.seg_group([["North", "South"], ["East", "West"]]))` There will be a validation step created for every segment. Note that if there aren't any segments created using `seg_group()` (or any other segment expression), the validation step will fail to be evaluated during the interrogation process. Such a failure to evaluate will be reported in the validation results but it won't affect the interrogation process overall (i.e., the process won't be halted). Parameters ---------- values A list of values to be grouped into a segment. This can be a single list or a list of lists. Returns ------- Segment A `Segment` object, which can be used to combine values into a segment. Examples -------- Let's say we're analyzing sales from our local bookstore, and want to check the number of books sold for the month exceeds a certain threshold. We could pass in the argument `segments="genre"`, which would return a segment for each unique genre in the datasets. We could also pass in `segments=("genre", ["Fantasy", "Science Fiction"])`, to only create segments for those two genres. However, if we wanted to group these two genres into a single segment, we could use the `seg_group()` function. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "title": [ "The Hobbit", "Harry Potter and the Sorcerer's Stone", "The Lord of the Rings", "A Game of Thrones", "The Name of the Wind", "The Girl with the Dragon Tattoo", "The Da Vinci Code", "The Hitchhiker's Guide to the Galaxy", "The Martian", "Brave New World" ], "genre": [ "Fantasy", "Fantasy", "Fantasy", "Fantasy", "Fantasy", "Mystery", "Mystery", "Science Fiction", "Science Fiction", "Science Fiction", ], "units_sold": [875, 932, 756, 623, 445, 389, 678, 534, 712, 598], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt( columns="units_sold", value=500, segments=("genre", pb.seg_group(["Fantasy", "Science Fiction"])) ) .interrogate() ) validation ``` What's more, we can create multiple segments, combining the genres in different ways. ```python validation = ( pb.Validate(data=tbl) .col_vals_gt( columns="units_sold", value=500, segments=("genre", pb.seg_group([ ["Fantasy", "Science Fiction"], ["Fantasy", "Mystery"], ["Mystery", "Science Fiction"] ])) ) .interrogate() ) validation ``` ## The Interrogation and Reporting family The validation plan is put into action when `interrogate()` is called. The workflow for performing a comprehensive validation is then: (1) `Validate()`, (2) adding validation steps, (3) `interrogate()`. After interrogation of the data, we can view a validation report table (by printing the object or using `get_tabular_report()`), extract key metrics, or we can split the data based on the validation results (with `get_sundered_data()`). interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' = True, get_first_n: 'int | None' = None, sample_n: 'int | None' = None, sample_frac: 'int | float | None' = None, extract_limit: 'int' = 500) -> 'Validate' Execute each validation step against the table and store the results. When a validation plan has been set with a series of validation steps, the interrogation process through `interrogate()` should then be invoked. Interrogation will evaluate each validation step against the table and store the results. The interrogation process will collect extracts of failing rows if the `collect_extracts=` option is set to `True` (the default). We can control the number of rows collected using the `get_first_n=`, `sample_n=`, and `sample_frac=` options. The `extract_limit=` option will enforce a hard limit on the number of rows collected when `collect_extracts=True`. After interrogation is complete, the `Validate` object will have gathered information, and we can use methods like [`n_passed()`](`pointblank.Validate.n_passed`), [`f_failed()`](`pointblank.Validate.f_failed`), etc., to understand how the table performed against the validation plan. A visual representation of the validation results can be viewed by printing the `Validate` object; this will display the validation table in an HTML viewing environment. Parameters ---------- collect_extracts An option to collect rows of the input table that didn't pass a particular validation step. The default is `True` and further options (i.e., `get_first_n=`, `sample_*=`) allow for fine control of how these rows are collected. collect_tbl_checked The processed data frames produced by executing the validation steps is collected and stored in the `Validate` object if `collect_tbl_checked=True`. This information is necessary for some methods (e.g., [`get_sundered_data()`](`pointblank.Validate.get_sundered_data`)), but it can potentially make the object grow to a large size. To opt out of attaching this data, set this to `False`. get_first_n If the option to collect rows where test units is chosen, there is the option here to collect the first `n` rows. Supply an integer number of rows to extract from the top of subset table containing non-passing rows (the ordering of data from the original table is retained). sample_n If the option to collect non-passing rows is chosen, this option allows for the sampling of `n` rows. Supply an integer number of rows to sample from the subset table. If `n` happens to be greater than the number of non-passing rows, then all such rows will be returned. sample_frac If the option to collect non-passing rows is chosen, this option allows for the sampling of a fraction of those rows. Provide a number in the range of `0` and `1`. The number of rows to return could be very large, however, the `extract_limit=` option will apply a hard limit to the returned rows. extract_limit A value that limits the possible number of rows returned when extracting non-passing rows. The default is `500` rows. This limit is applied after any sampling or limiting options are applied. If the number of rows to be returned is greater than this limit, then the number of rows returned will be limited to this value. This is useful for preventing the collection of too many rows when the number of non-passing rows is very large. Returns ------- Validate The `Validate` object with the results of the interrogation. Examples -------- Let's use a built-in dataset (`"game_revenue"`) to demonstrate some of the options of the interrogation process. A series of validation steps will populate our validation plan. After setting up the plan, the next step is to interrogate the table and see how well it aligns with our expectations. We'll use the `get_first_n=` option so that any extracts of failing rows are limited to the first `n` rows. ```python import pointblank as pb import polars as pl validation = ( pb.Validate(data=pb.load_dataset(dataset="game_revenue")) .col_vals_lt(columns="item_revenue", value=200) .col_vals_gt(columns="item_revenue", value=0) .col_vals_gt(columns="session_duration", value=5) .col_vals_in_set(columns="item_type", set=["iap", "ad"]) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") ) validation.interrogate(get_first_n=10) ``` The validation table shows that step 3 (checking for `session_duration` greater than `5`) has 18 failing test units. This means that 18 rows in the table are problematic. We'd like to see the rows that failed this validation step and we can do that with the [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method. ```python pb.preview(validation.get_data_extracts(i=3, frame=True)) ``` The [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method will return a Polars DataFrame here with the first 10 rows that failed the validation step (we passed that into the [`preview()`](`pointblank.preview`) function for a better display). There are actually 18 rows that failed but we limited the collection of extracts with `get_first_n=10`. set_tbl(self, tbl: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None' = None) -> 'Validate' Set or replace the table associated with the Validate object. This method allows you to replace the table associated with a Validate object with a different (but presumably similar) table. This is useful when you want to apply the same validation plan to multiple tables or when you have a validation workflow defined but want to swap in a different data source. Parameters ---------- tbl The table to replace the existing table with. This can be any supported table type including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths, GitHub URLs, or database connection strings. The same table type constraints apply as in the `Validate` constructor. tbl_name An optional name to assign to the new input table object. If no value is provided, the existing table name will be retained. label An optional label for the validation plan. If no value is provided, the existing label will be retained. Returns ------- Validate A new `Validate` object with the replacement table. When to Use ----------- The `set_tbl()` method is particularly useful in scenarios where you have: - multiple similar tables that need the same validation checks - a template validation workflow that should be applied to different data sources - YAML-defined validations where you want to override the table specified in the YAML The `set_tbl()` method creates a copy of the validation object with the new table, so the original validation object remains unchanged. This allows you to reuse validation plans across multiple tables without interference. Examples -------- We will first create two similar tables for our future validation plans. ```python import pointblank as pb import polars as pl # Create two similar tables table_1 = pl.DataFrame({ "x": [1, 2, 3, 4, 5], "y": [5, 4, 3, 2, 1], "z": ["a", "b", "c", "d", "e"] }) table_2 = pl.DataFrame({ "x": [2, 4, 6, 8, 10], "y": [10, 8, 6, 4, 2], "z": ["f", "g", "h", "i", "j"] }) ``` Create a validation plan with the first table. ```python validation_table_1 = ( pb.Validate( data=table_1, tbl_name="Table 1", label="Validation applied to the first table" ) .col_vals_gt(columns="x", value=0) .col_vals_lt(columns="y", value=10) ) ``` Now apply the same validation plan to the second table. ```python validation_table_2 = ( validation_table_1 .set_tbl( tbl=table_2, tbl_name="Table 2", label="Validation applied to the second table" ) ) ``` Here is the interrogation of the first table: ```python validation_table_1.interrogate() ``` And the second table: ```python validation_table_2.interrogate() ``` get_tabular_report(self, title: 'str | None' = ':default:', incl_header: 'bool' = None, incl_footer: 'bool' = None) -> 'GT' Validation report as a GT table. The `get_tabular_report()` method returns a GT table object that represents the validation report. This validation table provides a summary of the validation results, including the validation steps, the number of test units, the number of failing test units, and the fraction of failing test units. The table also includes status indicators for the 'warning', 'error', and 'critical' levels. You could simply display the validation table without the use of the `get_tabular_report()` method. However, the method provides a way to customize the title of the report. In the future this method may provide additional options for customizing the report. Parameters ---------- title Options for customizing the title of the report. The default is the `":default:"` value which produces a generic title. Another option is `":tbl_name:"`, and that presents the name of the table as the title for the report. If no title is wanted, then `":none:"` can be used. Aside from keyword options, text can be provided for the title. This will be interpreted as Markdown text and transformed internally to HTML. Returns ------- GT A GT table object that represents the validation report. Examples -------- Let's create a `Validate` object with a few validation steps and then interrogate the data table to see how it performs against the validation plan. We can then generate a tabular report to get a summary of the results. ```python import pointblank as pb import polars as pl # Create a Polars DataFrame tbl_pl = pl.DataFrame({"x": [1, 2, 3, 4], "y": [4, 5, 6, 7]}) # Validate data using Polars DataFrame validation = ( pb.Validate(data=tbl_pl, tbl_name="tbl_xy", thresholds=(2, 3, 4)) .col_vals_gt(columns="x", value=1) .col_vals_lt(columns="x", value=3) .col_vals_le(columns="y", value=7) .interrogate() ) # Look at the validation table validation ``` The validation table is displayed with a default title ('Validation Report'). We can use the `get_tabular_report()` method to customize the title of the report. For example, we can set the title to the name of the table by using the `title=":tbl_name:"` option. This will use the string provided in the `tbl_name=` argument of the `Validate` object. ```python validation.get_tabular_report(title=":tbl_name:") ``` The title of the report is now set to the name of the table, which is 'tbl_xy'. This can be useful if you have multiple tables and want to keep track of which table the validation report is for. Alternatively, you can provide your own title for the report. ```python validation.get_tabular_report(title="Report for Table XY") ``` The title of the report is now set to 'Report for Table XY'. This can be useful if you want to provide a more descriptive title for the report. get_step_report(self, i: 'int', columns_subset: 'str | list[str] | Column | None' = None, header: 'str' = ':default:', limit: 'int | None' = 10) -> 'GT' Get a detailed report for a single validation step. The `get_step_report()` method returns a report of what went well---or what failed spectacularly---for a given validation step. The report includes a summary of the validation step and a detailed breakdown of the interrogation results. The report is presented as a GT table object, which can be displayed in a notebook or exported to an HTML file. :::{.callout-warning} The `get_step_report()` method is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- i The step number for which to get the report. columns_subset The columns to display in a step report that shows errors in the input table. By default all columns are shown (`None`). If a subset of columns is desired, we can provide a list of column names, a string with a single column name, a `Column` object, or a `ColumnSelector` object. The last two options allow for more flexible column selection using column selector functions. Errors are raised if the column names provided don't match any columns in the table (when provided as a string or list of strings) or if column selector expressions don't resolve to any columns. header Options for customizing the header of the step report. The default is the `":default:"` value which produces a header with a standard title and set of details underneath. Aside from this default, free text can be provided for the header. This will be interpreted as Markdown text and transformed internally to HTML. You can provide one of two templating elements: `{title}` and `{details}`. The default header has the template `"{title}{details}"` so you can easily start from that and modify as you see fit. If you don't want a header at all, you can set `header=None` to remove it entirely. limit The number of rows to display for those validation steps that check values in rows (the `col_vals_*()` validation steps). The default is `10` rows and the limit can be removed entirely by setting `limit=None`. Returns ------- GT A GT table object that represents the detailed report for the validation step. Types of Step Reports --------------------- The `get_step_report()` method produces a report based on the *type* of validation step. The following column-value or row-based validation step validation methods will produce a report that shows the rows of the data that failed: - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`) - [`conjointly()`](`pointblank.Validate.conjointly`) - [`prompt()`](`pointblank.Validate.prompt`) - [`rows_complete()`](`pointblank.Validate.rows_complete`) The [`rows_distinct()`](`pointblank.Validate.rows_distinct`) validation step will produce a report that shows duplicate rows (or duplicate values in one or a set of columns as defined in that method's `columns_subset=` parameter. The [`col_schema_match()`](`pointblank.Validate.col_schema_match`) validation step will produce a report that shows the schema of the data table and the schema of the validation step. The report will indicate whether the schemas match or not. Examples -------- Let's create a validation plan with a few validation steps and interrogate the data. With that, we'll have a look at the validation reporting table for the entire collection of steps and what went well or what failed. ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="pandas"), tbl_name="small_table", label="Example for the get_step_report() method", thresholds=(1, 0.20, 0.40) ) .col_vals_lt(columns="d", value=3500) .col_vals_between(columns="c", left=1, right=8) .col_vals_gt(columns="a", value=3) .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}") .interrogate() ) validation ``` There were four validation steps performed, where the first three steps had failing test units and the last step had no failures. Let's get a detailed report for the first step by using the `get_step_report()` method. ```python validation.get_step_report(i=1) ``` The report for the first step is displayed. The report includes a summary of the validation step and a detailed breakdown of the interrogation results. The report provides details on what the validation step was checking, the extent to which the test units failed, and a table that shows the failing rows of the data with the column of interest highlighted. The second and third steps also had failing test units. Reports for those steps can be viewed by using `get_step_report(i=2)` and `get_step_report(i=3)` respectively. The final step did not have any failing test units. A report for the final step can still be viewed by using `get_step_report(i=4)`. The report will indicate that every test unit passed and a prview of the target table will be provided. ```python validation.get_step_report(i=4) ``` If you'd like to trim down the number of columns shown in the report, you can provide a subset of columns to display. For example, if you only want to see the columns `a`, `b`, and `c`, you can provide those column names as a list. ```python validation.get_step_report(i=1, columns_subset=["a", "b", "c"]) ``` If you'd like to increase or reduce the maximum number of rows shown in the report, you can provide a different value for the `limit` parameter. For example, if you'd like to see only up to 5 rows, you can set `limit=5`. ```python validation.get_step_report(i=3, limit=5) ``` Step 3 actually had 7 failing test units, but only the first 5 rows are shown in the step report because of the `limit=5` parameter. get_json_report(self, use_fields: 'list[str] | None' = None, exclude_fields: 'list[str] | None' = None) -> 'str' Get a report of the validation results as a JSON-formatted string. The `get_json_report()` method provides a machine-readable report of validation results in JSON format. This is particularly useful for programmatic processing, storing validation results, or integrating with other systems. The report includes detailed information about each validation step, such as assertion type, columns validated, threshold values, test results, and more. By default, all available validation information fields are included in the report. However, you can customize the fields to include or exclude using the `use_fields=` and `exclude_fields=` parameters. Parameters ---------- use_fields An optional list of specific fields to include in the report. If provided, only these fields will be included in the JSON output. If `None` (the default), all standard validation report fields are included. Have a look at the *Available Report Fields* section below for a list of fields that can be included in the report. exclude_fields An optional list of fields to exclude from the report. If provided, these fields will be omitted from the JSON output. If `None` (the default), no fields are excluded. This parameter cannot be used together with `use_fields=`. The *Available Report Fields* provides a listing of fields that can be excluded from the report. Returns ------- str A JSON-formatted string representing the validation report, with each validation step as an object in the report array. Available Report Fields ----------------------- The JSON report can include any of the standard validation report fields, including: - `i`: the step number (1-indexed) - `i_o`: the original step index from the validation plan (pre-expansion) - `assertion_type`: the type of validation assertion (e.g., `"col_vals_gt"`, etc.) - `column`: the column being validated (or columns used in certain validations) - `values`: the comparison values or parameters used in the validation - `inclusive`: whether the comparison is inclusive (for range-based validations) - `na_pass`: whether `NA`/`Null` values are considered passing (for certain validations) - `pre`: preprocessing function applied before validation - `segments`: data segments to which the validation was applied - `thresholds`: threshold level statement that was used for the validation step - `label`: custom label for the validation step - `brief`: a brief description of the validation step - `active`: whether the validation step is active - `all_passed`: whether all test units passed in the step - `n`: total number of test units - `n_passed`, `n_failed`: number of test units that passed and failed - `f_passed`, `f_failed`: Fraction of test units that passed and failed - `warning`, `error`, `critical`: whether the namesake threshold level was exceeded (is `null` if threshold not set) - `time_processed`: when the validation step was processed (ISO 8601 format) - `proc_duration_s`: the processing duration in seconds Examples -------- Let's create a validation plan with a few validation steps and generate a JSON report of the results: ```python import pointblank as pb import polars as pl # Create a sample DataFrame tbl = pl.DataFrame({ "a": [5, 7, 8, 9], "b": [3, 4, 2, 1] }) # Create and execute a validation plan validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=6) .col_vals_lt(columns="b", value=4) .interrogate() ) # Get the full JSON report json_report = validation.get_json_report() print(json_report) ``` You can also customize which fields to include: ```python json_report = validation.get_json_report( use_fields=["i", "assertion_type", "column", "n_passed", "n_failed"] ) print(json_report) ``` Or which fields to exclude: ```python json_report = validation.get_json_report( exclude_fields=[ "i_o", "thresholds", "pre", "segments", "values", "na_pass", "inclusive", "label", "brief", "active", "time_processed", "proc_duration_s" ] ) print(json_report) ``` The JSON output can be further processed or analyzed programmatically: ```python import json # Parse the JSON report report_data = json.loads(validation.get_json_report()) # Extract and analyze validation results failing_steps = [step for step in report_data if step["n_failed"] > 0] print(f"Number of failing validation steps: {len(failing_steps)}") ``` See Also -------- - [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`): Get a formatted HTML report as a GT table - [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`): Get rows that failed validation get_sundered_data(self, type='pass') -> 'FrameT' Get the data that passed or failed the validation steps. Validation of the data is one thing but, sometimes, you want to use the best part of the input dataset for something else. The `get_sundered_data()` method works with a `Validate` object that has been interrogated (i.e., the [`interrogate()`](`pointblank.Validate.interrogate`) method was used). We can get either the 'pass' data piece (rows with no failing test units across all column-value based validation functions), or, the 'fail' data piece (rows with at least one failing test unit across the same series of validations). Details ------- There are some caveats to sundering. The validation steps considered for this splitting will only involve steps where: - of certain check types, where test units are cells checked down a column (e.g., the `col_vals_*()` methods) - `active=` is not set to `False` - `pre=` has not been given an expression for modifying the input table So long as these conditions are met, the data will be split into two constituent tables: one with the rows that passed all validation steps and another with the rows that failed at least one validation step. Parameters ---------- type The type of data to return. Options are `"pass"` or `"fail"`, where the former returns a table only containing rows where test units always passed validation steps, and the latter returns a table only containing rows had test units that failed in at least one validation step. Returns ------- FrameT A table containing the data that passed or failed the validation steps. Examples -------- Let's create a `Validate` object with three validation steps and then interrogate the data. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 6, 9, 7, 3, 2], "b": [9, 8, 10, 5, 10, 6], "c": ["c", "d", "a", "b", "a", "b"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation ``` From the validation table, we can see that the first and second steps each had 4 passing test units. A failing test unit will mark the entire row as failing in the context of the `get_sundered_data()` method. We can use this method to get the rows of data that passed the during interrogation. ```python pb.preview(validation.get_sundered_data()) ``` The returned DataFrame contains the rows that passed all validation steps (we passed this object to [`preview()`](`pointblank.preview`) to show it in an HTML view). From the six-row input DataFrame, the first two rows and the last two rows had test units that failed validation. Thus the middle two rows are the only ones that passed all validation steps and that's what we see in the returned DataFrame. get_data_extracts(self, i: 'int | list[int] | None' = None, frame: 'bool' = False) -> 'dict[int, FrameT | None] | FrameT | None' Get the rows that failed for each validation step. After the [`interrogate()`](`pointblank.Validate.interrogate`) method has been called, the `get_data_extracts()` method can be used to extract the rows that failed in each column-value or row-based validation step (e.g., [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`), [`rows_distinct()`](`pointblank.Validate.rows_distinct`), etc.). The method returns a dictionary of tables containing the rows that failed in every validation step. If `frame=True` and `i=` is a scalar, the value is conveniently returned as a table (forgoing the dictionary structure). Parameters ---------- i The validation step number(s) from which the failed rows are obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. frame If `True` and `i=` is a scalar, return the value as a DataFrame instead of a dictionary. Returns ------- dict[int, FrameT | None] | FrameT | None A dictionary of tables containing the rows that failed in every compatible validation step. Alternatively, it can be a DataFrame if `frame=True` and `i=` is a scalar. Compatible Validation Methods for Yielding Extracted Rows --------------------------------------------------------- The following validation methods operate on column values and will have rows extracted when there are failing test units. - [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) - [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) - [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) - [`col_vals_le()`](`pointblank.Validate.col_vals_le`) - [`col_vals_eq()`](`pointblank.Validate.col_vals_eq`) - [`col_vals_ne()`](`pointblank.Validate.col_vals_ne`) - [`col_vals_between()`](`pointblank.Validate.col_vals_between`) - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`) - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`) - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`) - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`) - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`) - [`col_vals_null()`](`pointblank.Validate.col_vals_null`) - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`) - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`) - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`) - [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`) - [`conjointly()`](`pointblank.Validate.conjointly`) - [`prompt()`](`pointblank.Validate.prompt`) An extracted row for these validation methods means that a test unit failed for that row in the validation step. These row-based validation methods will also have rows extracted should there be failing rows: - [`rows_distinct()`](`pointblank.Validate.rows_distinct`) - [`rows_complete()`](`pointblank.Validate.rows_complete`) The extracted rows are a subset of the original table and are useful for further analysis or for understanding the nature of the failing test units. Examples -------- Let's perform a series of validation steps on a Polars DataFrame. We'll use the [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) in the first step, [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) in the second step, and [`col_vals_ge()`](`pointblank.Validate.col_vals_ge`) in the third step. The [`interrogate()`](`pointblank.Validate.interrogate`) method executes the validation; then, we can extract the rows that failed for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [5, 6, 5, 3, 6, 1], "b": [1, 2, 1, 5, 2, 6], "c": [3, 7, 2, 6, 3, 1], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=4) .col_vals_lt(columns="c", value=5) .col_vals_ge(columns="b", value=1) .interrogate() ) validation.get_data_extracts() ``` The `get_data_extracts()` method returns a dictionary of tables, where each table contains a subset of rows from the table. These are the rows that failed for each validation step. In the first step, the[`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) method was used to check if the values in column `a` were greater than `4`. The extracted table shows the rows where this condition was not met; look at the `a` column: all values are less than `4`. In the second step, the [`col_vals_lt()`](`pointblank.Validate.col_vals_lt`) method was used to check if the values in column `c` were less than `5`. In the extracted two-row table, we see that the values in column `c` are greater than `5`. The third step ([`col_vals_ge()`](`pointblank.Validate.col_vals_ge`)) checked if the values in column `b` were greater than or equal to `1`. There were no failing test units, so the extracted table is empty (i.e., has columns but no rows). The `i=` argument can be used to narrow down the extraction to one or more steps. For example, to extract the rows that failed in the first step only: ```python validation.get_data_extracts(i=1) ``` Note that the first validation step is indexed at `1` (not `0`). This 1-based indexing is in place here to match the step numbers reported in the validation table. What we get back is still a dictionary, but it only contains one table (the one for the first step). If you want to get the extracted table as a DataFrame, set `frame=True` and provide a scalar value for `i`. For example, to get the extracted table for the second step as a DataFrame: ```python pb.preview(validation.get_data_extracts(i=2, frame=True)) ``` The extracted table is now a DataFrame, which can serve as a more convenient format for further analysis or visualization. We further used the [`preview()`](`pointblank.preview`) function to show the DataFrame in an HTML view. all_passed(self) -> 'bool' Determine if every validation step passed perfectly, with no failing test units. The `all_passed()` method determines if every validation step passed perfectly, with no failing test units. This method is useful for quickly checking if the table passed all validation steps with flying colors. If there's even a single failing test unit in any validation step, this method will return `False`. This validation metric might be overly stringent for some validation plans where failing test units are generally expected (and the strategy is to monitor data quality over time). However, the value of `all_passed()` could be suitable for validation plans designed to ensure that every test unit passes perfectly (e.g., checks for column presence, null-checking tests, etc.). Returns ------- bool `True` if all validation steps had no failing test units, `False` otherwise. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the second step will have a failing test unit (the value `10` isn't less than `9`). After interrogation, the `all_passed()` method is used to determine if all validation steps passed perfectly. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 9, 5], "b": [5, 6, 10, 3], "c": ["a", "b", "a", "a"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0) .col_vals_lt(columns="b", value=9) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.all_passed() ``` The returned value is `False` since the second validation step had a failing test unit. If it weren't for that one failing test unit, the return value would have been `True`. assert_passing(self) -> 'None' Raise an `AssertionError` if all tests are not passing. The `assert_passing()` method will raise an `AssertionError` if a test does not pass. This method simply wraps `all_passed` for more ready use in test suites. The step number and assertion made is printed in the `AssertionError` message if a failure occurs, ensuring some details are preserved. If the validation has not yet been interrogated, this method will automatically call [`interrogate()`](`pointblank.Validate.interrogate`) with default parameters before checking for passing tests. Raises ------- AssertionError If any validation step has failing test units. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the second step will have a failing test unit (the value `10` isn't less than `9`). The `assert_passing()` method is used to assert that all validation steps passed perfectly, automatically performing the interrogation if needed. ```python #| error: True import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 9, 5], "b": [5, 6, 10, 3], "c": ["a", "b", "a", "a"], } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0) .col_vals_lt(columns="b", value=9) # this assertion is false .col_vals_in_set(columns="c", set=["a", "b"]) ) # No need to call [`interrogate()`](`pointblank.Validate.interrogate`) explicitly validation.assert_passing() ``` assert_below_threshold(self, level: 'str' = 'warning', i: 'int | None' = None, message: 'str | None' = None) -> 'None' Raise an `AssertionError` if validation steps exceed a specified threshold level. The `assert_below_threshold()` method checks whether validation steps' failure rates are below a given threshold level (`"warning"`, `"error"`, or `"critical"`). This is particularly useful in automated testing environments where you want to ensure your data quality meets minimum standards before proceeding. If any validation step exceeds the specified threshold level, an `AssertionError` will be raised with details about which steps failed. If the validation has not yet been interrogated, this method will automatically call [`interrogate()`](`pointblank.Validate.interrogate`) with default parameters. Parameters ---------- level The threshold level to check against, which could be any of `"warning"` (the default), `"error"`, or `"critical"`. An `AssertionError` will be raised if any validation step exceeds this level. i Specific validation step number(s) to check. Can be provided as a single integer or a list of integers. If `None` (the default), all steps are checked. message Custom error message to use if assertion fails. If `None`, a default message will be generated that lists the specific steps that exceeded the threshold. Returns ------- None Raises ------ AssertionError If any specified validation step exceeds the given threshold level. ValueError If an invalid threshold level is provided. Examples -------- Below are some examples of how to use the `assert_below_threshold()` method. First, we'll create a simple Polars DataFrame with two columns (`a` and `b`). ```python import polars as pl tbl = pl.DataFrame({ "a": [7, 4, 9, 7, 12], "b": [9, 8, 10, 5, 10] }) ``` Then a validation plan will be created with thresholds (`warning=0.1`, `error=0.2`, `critical=0.3`). After interrogating, we display the validation report table: ```python import pointblank as pb validation = ( pb.Validate(data=tbl, thresholds=(0.1, 0.2, 0.3)) .col_vals_gt(columns="a", value=5) # 1 failing test unit .col_vals_lt(columns="b", value=10) # 2 failing test units .interrogate() ) validation ``` Using `assert_below_threshold(level="warning")` will raise an `AssertionError` if any step exceeds the 'warning' threshold: Check a specific step against the 'critical' threshold using the `i=` parameter: ```python validation.assert_below_threshold(level="critical", i=1) # Won't raise an error ``` As the first step is below the 'critical' threshold (it exceeds the 'warning' and 'error' thresholds), no error is raised and nothing is printed. We can also provide a custom error message with the `message=` parameter. Let's try that here: ```python try: validation.assert_below_threshold( level="error", message="Data quality too low for processing!" ) except AssertionError as e: print(f"Custom error: {e}") ``` See Also -------- - [`warning()`](`pointblank.Validate.warning`): get the 'warning' status for each validation step - [`error()`](`pointblank.Validate.error`): get the 'error' status for each validation step - [`critical()`](`pointblank.Validate.critical`): get the 'critical' status for each validation step - [`assert_passing()`](`pointblank.Validate.assert_passing`): assert all validations pass completely above_threshold(self, level: 'str' = 'warning', i: 'int | None' = None) -> 'bool' Check if any validation steps exceed a specified threshold level. The `above_threshold()` method checks whether validation steps exceed a given threshold level. This provides a non-exception-based alternative to [`assert_below_threshold()`](`pointblank.Validate.assert_below_threshold`) for conditional workflow control based on validation results. This method is useful in scenarios where you want to check if any validation steps failed beyond a certain threshold without raising an exception, allowing for more flexible programmatic responses to validation issues. Parameters ---------- level The threshold level to check against. Valid options are: `"warning"` (the least severe threshold level), `"error"` (the middle severity threshold level), and `"critical"` (the most severe threshold level). The default is `"warning"`. i Specific validation step number(s) to check. If a single integer, checks only that step. If a list of integers, checks all specified steps. If `None` (the default), checks all validation steps. Step numbers are 1-based (first step is `1`, not `0`). Returns ------- bool `True` if any of the specified validation steps exceed the given threshold level, `False` otherwise. Raises ------ ValueError If an invalid threshold level is provided. Examples -------- Below are some examples of how to use the `above_threshold()` method. First, we'll create a simple Polars DataFrame with a single column (`values`). Then a validation plan will be created with thresholds (`warning=0.1`, `error=0.2`, `critical=0.3`). After interrogating, we display the validation report table: ```python import pointblank as pb validation = ( pb.Validate(data=tbl, thresholds=(0.1, 0.2, 0.3)) .col_vals_gt(columns="values", value=0) .col_vals_lt(columns="values", value=10) .col_vals_between(columns="values", left=0, right=5) .interrogate() ) validation ``` Let's check if any steps exceed the 'warning' threshold with the `above_threshold()` method. A message will be printed if that's the case: ```python if validation.above_threshold(level="warning"): print("Some steps have exceeded the warning threshold") ``` Check if only steps 2 and 3 exceed the 'error' threshold through use of the `i=` argument: ```python if validation.above_threshold(level="error", i=[2, 3]): print("Steps 2 and/or 3 have exceeded the error threshold") ``` You can use this in a workflow to conditionally trigger processes. Here's a snippet of how you might use this in a function: ```python def process_data(validation_obj): # Only continue processing if validation passes critical thresholds if not validation_obj.above_threshold(level="critical"): # Continue with processing print("Data meets critical quality thresholds, proceeding...") return True else: # Log failure and stop processing print("Data fails critical quality checks, aborting...") return False ``` Note that this is just a suggestion for how to implement conditional workflow processes. You should adapt this pattern to your specific requirements, which might include different threshold levels, custom logging mechanisms, or integration with your organization's data pipelines and notification systems. See Also -------- - [`assert_below_threshold()`](`pointblank.Validate.assert_below_threshold`): a similar method that raises an exception if thresholds are exceeded - [`warning()`](`pointblank.Validate.warning`): get the 'warning' status for each validation step - [`error()`](`pointblank.Validate.error`): get the 'error' status for each validation step - [`critical()`](`pointblank.Validate.critical`): get the 'critical' status for each validation step n(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, int] | int' Provides a dictionary of the number of test units for each validation step. The `n()` method provides the number of test units for each validation step. This is the total number of test units that were evaluated in the validation step. It is always an integer value. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. The method provides a dictionary of the number of test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. The total number of test units for a validation step is the sum of the number of passing and failing test units (i.e., `n = n_passed + n_failed`). Parameters ---------- i The validation step number(s) from which the number of test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, int] | int A dictionary of the number of test units for each validation step or a scalar value. Examples -------- Different types of validation steps can have different numbers of test units. In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the number of test units for each step will be a little bit different. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [1, 2, 9, 5], "b": [5, 6, 10, 3], "c": ["a", "b", "a", "a"], } ) # Define a preprocessing function def filter_by_a_gt_1(df): return df.filter(pl.col("a") > 1) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=0) .col_exists(columns="b") .col_vals_lt(columns="b", value=9, pre=filter_by_a_gt_1) .interrogate() ) ``` The first validation step checks that all values in column `a` are greater than `0`. Let's use the `n()` method to determine the number of test units this validation step. ```python validation.n(i=1, scalar=True) ``` The returned value of `4` is the number of test units for the first validation step. This value is the same as the number of rows in the table. The second validation step checks for the existence of column `b`. Using the `n()` method we can get the number of test units for this the second step. ```python validation.n(i=2, scalar=True) ``` There's a single test unit here because the validation step is checking for the presence of a single column. The third validation step checks that all values in column `b` are less than `9` after filtering the table to only include rows where the value in column `a` is greater than `1`. Because the table is filtered, the number of test units will be less than the total number of rows in the input table. Let's prove this by using the `n()` method. ```python validation.n(i=3, scalar=True) ``` The returned value of `3` is the number of test units for the third validation step. When using the `pre=` argument, the input table can be mutated before performing the validation. The `n()` method is a good way to determine whether the mutation performed as expected. In all of these examples, the `scalar=True` argument was used to return the value as a scalar integer value. If `scalar=False`, the method will return a dictionary with an entry for the validation step number (from the `i=` argument) and the number of test units. Futhermore, leaving out the `i=` argument altogether will return a dictionary with filled with the number of test units for each validation step. Here's what that looks like: ```python validation.n() ``` n_passed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, int] | int' Provides a dictionary of the number of test units that passed for each validation step. The `n_passed()` method provides the number of test units that passed for each validation step. This is the number of test units that passed in the the validation step. It is always some integer value between `0` and the total number of test units. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. The method provides a dictionary of the number of passing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`n_passed()`](`pointblank.Validate.n_passed`) method (i.e., `n - n_failed`). Parameters ---------- i The validation step number(s) from which the number of passing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, int] | int A dictionary of the number of passing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps and, as it turns out, all of them will have failing test units. After interrogation, the `n_passed()` method is used to determine the number of passing test units for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12], "b": [9, 8, 10, 5, 10], "c": ["a", "b", "c", "a", "b"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.n_passed() ``` The returned dictionary shows that all validation steps had no passing test units (each value was less than `5`, which is the total number of test units for each step). If we wanted to check the number of passing test units for a single validation step, we can provide the step number. Also, we could forego the dictionary and get a scalar value by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.n_passed(i=1) ``` The returned value of `4` is the number of passing test units for the first validation step. n_failed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, int] | int' Provides a dictionary of the number of test units that failed for each validation step. The `n_failed()` method provides the number of test units that failed for each validation step. This is the number of test units that did not pass in the the validation step. It is always some integer value between `0` and the total number of test units. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. The method provides a dictionary of the number of failing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`n_passed()`](`pointblank.Validate.n_passed`) method (i.e., `n - n_passed`). Parameters ---------- i The validation step number(s) from which the number of failing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, int] | int A dictionary of the number of failing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps and, as it turns out, all of them will have failing test units. After interrogation, the `n_failed()` method is used to determine the number of failing test units for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12], "b": [9, 8, 10, 5, 10], "c": ["a", "b", "c", "a", "b"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.n_failed() ``` The returned dictionary shows that all validation steps had failing test units. If we wanted to check the number of failing test units for a single validation step, we can provide the step number. Also, we could forego the dictionary and get a scalar value by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.n_failed(i=1) ``` The returned value of `1` is the number of failing test units for the first validation step. f_passed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, float] | float' Provides a dictionary of the fraction of test units that passed for each validation step. A measure of the fraction of test units that passed is provided by the `f_passed` attribute. This is the fraction of test units that passed the validation step over the total number of test units. Given this is a fractional value, it will always be in the range of `0` to `1`. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. This method provides a dictionary of the fraction of passing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`f_failed()`](`pointblank.Validate.f_failed`) method (i.e., `1 - f_failed()`). Parameters ---------- i The validation step number(s) from which the fraction of passing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, float] | float A dictionary of the fraction of passing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, all having some failing test units. After interrogation, the `f_passed()` method is used to determine the fraction of passing test units for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12, 3, 10], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "c", "a", "b", "d", "c"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.f_passed() ``` The returned dictionary shows the fraction of passing test units for each validation step. The values are all less than `1` since there were failing test units in each step. If we wanted to check the fraction of passing test units for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.f_passed(i=1) ``` The returned value is the proportion of passing test units for the first validation step (5 passing test units out of 7 total test units). f_failed(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, float] | float' Provides a dictionary of the fraction of test units that failed for each validation step. A measure of the fraction of test units that failed is provided by the `f_failed` attribute. This is the fraction of test units that failed the validation step over the total number of test units. Given this is a fractional value, it will always be in the range of `0` to `1`. Test units are the atomic units of the validation process. Different validations can have different numbers of test units. For example, a validation that checks for the presence of a column in a table will have a single test unit. A validation that checks for the presence of a value in a column will have as many test units as there are rows in the table. This method provides a dictionary of the fraction of failing test units for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Furthermore, a value obtained here will be the complement to the analogous value returned by the [`f_passed()`](`pointblank.Validate.f_passed`) method (i.e., `1 - f_passed()`). Parameters ---------- i The validation step number(s) from which the fraction of failing test units is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, float] | float A dictionary of the fraction of failing test units for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, all having some failing test units. After interrogation, the `f_failed()` method is used to determine the fraction of failing test units for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12, 3, 10], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "c", "a", "b", "d", "c"] } ) validation = ( pb.Validate(data=tbl) .col_vals_gt(columns="a", value=5) .col_vals_gt(columns="b", value=pb.col("a")) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.f_failed() ``` The returned dictionary shows the fraction of failing test units for each validation step. The values are all greater than `0` since there were failing test units in each step. If we wanted to check the fraction of failing test units for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.f_failed(i=1) ``` The returned value is the proportion of failing test units for the first validation step (2 failing test units out of 7 total test units). warning(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, bool] | bool' Get the 'warning' level status for each validation step. The 'warning' status for a validation step is `True` if the fraction of failing test units meets or exceeds the threshold for the 'warning' level. Otherwise, the status is `False`. The ascribed name of 'warning' is semantic and does not imply that a warning message is generated, it is simply a status indicator that could be used to trigger some action to be taken. Here's how it fits in with other status indicators: - 'warning': the status obtained by calling 'warning()', least severe - 'error': the status obtained by calling [`error()`](`pointblank.Validate.error`), middle severity - 'critical': the status obtained by calling [`critical()`](`pointblank.Validate.critical`), most severe This method provides a dictionary of the 'warning' status for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Parameters ---------- i The validation step number(s) from which the 'warning' status is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, bool] | bool A dictionary of the 'warning' status for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the first step will have some failing test units, the rest will be completely passing. We've set thresholds here for each of the steps by using `thresholds=(2, 4, 5)`, which means: - the 'warning' threshold is `2` failing test units - the 'error' threshold is `4` failing test units - the 'critical' threshold is `5` failing test units After interrogation, the `warning()` method is used to determine the 'warning' status for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [7, 4, 9, 7, 12, 3, 10], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "a", "a", "b", "b", "a"] } ) validation = ( pb.Validate(data=tbl, thresholds=(2, 4, 5)) .col_vals_gt(columns="a", value=5) .col_vals_lt(columns="b", value=15) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.warning() ``` The returned dictionary provides the 'warning' status for each validation step. The first step has a `True` value since the number of failing test units meets the threshold for the 'warning' level. The second and third steps have `False` values since the number of failing test units was `0`, which is below the threshold for the 'warning' level. We can also visually inspect the 'warning' status across all steps by viewing the validation table: ```python validation ``` We can see that there's a filled gray circle in the first step (look to the far right side, in the `W` column) indicating that the 'warning' threshold was met. The other steps have empty gray circles. This means that thresholds were 'set but not met' in those steps. If we wanted to check the 'warning' status for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.warning(i=1) ``` The returned value is `True`, indicating that the first validation step had met the 'warning' threshold. error(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, bool] | bool' Get the 'error' level status for each validation step. The 'error' status for a validation step is `True` if the fraction of failing test units meets or exceeds the threshold for the 'error' level. Otherwise, the status is `False`. The ascribed name of 'error' is semantic and does not imply that the validation process is halted, it is simply a status indicator that could be used to trigger some action to be taken. Here's how it fits in with other status indicators: - 'warning': the status obtained by calling [`warning()`](`pointblank.Validate.warning`), least severe - 'error': the status obtained by calling `error()`, middle severity - 'critical': the status obtained by calling [`critical()`](`pointblank.Validate.critical`), most severe This method provides a dictionary of the 'error' status for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Parameters ---------- i The validation step number(s) from which the 'error' status is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, bool] | bool A dictionary of the 'error' status for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the first step will have some failing test units, the rest will be completely passing. We've set thresholds here for each of the steps by using `thresholds=(2, 4, 5)`, which means: - the 'warning' threshold is `2` failing test units - the 'error' threshold is `4` failing test units - the 'critical' threshold is `5` failing test units After interrogation, the `error()` method is used to determine the 'error' status for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [3, 4, 9, 7, 2, 3, 8], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "a", "a", "b", "b", "a"] } ) validation = ( pb.Validate(data=tbl, thresholds=(2, 4, 5)) .col_vals_gt(columns="a", value=5) .col_vals_lt(columns="b", value=15) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.error() ``` The returned dictionary provides the 'error' status for each validation step. The first step has a `True` value since the number of failing test units meets the threshold for the 'error' level. The second and third steps have `False` values since the number of failing test units was `0`, which is below the threshold for the 'error' level. We can also visually inspect the 'error' status across all steps by viewing the validation table: ```python validation ``` We can see that there are filled gray and yellow circles in the first step (far right side, in the `W` and `E` columns) indicating that the 'warning' and 'error' thresholds were met. The other steps have empty gray and yellow circles. This means that thresholds were 'set but not met' in those steps. If we wanted to check the 'error' status for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.error(i=1) ``` The returned value is `True`, indicating that the first validation step had the 'error' threshold met. critical(self, i: 'int | list[int] | None' = None, scalar: 'bool' = False) -> 'dict[int, bool] | bool' Get the 'critical' level status for each validation step. The 'critical' status for a validation step is `True` if the fraction of failing test units meets or exceeds the threshold for the 'critical' level. Otherwise, the status is `False`. The ascribed name of 'critical' is semantic and is thus simply a status indicator that could be used to trigger some action to be take. Here's how it fits in with other status indicators: - 'warning': the status obtained by calling [`warning()`](`pointblank.Validate.warning`), least severe - 'error': the status obtained by calling [`error()`](`pointblank.Validate.error`), middle severity - 'critical': the status obtained by calling `critical()`, most severe This method provides a dictionary of the 'critical' status for each validation step. If the `scalar=True` argument is provided and `i=` is a scalar, the value is returned as a scalar instead of a dictionary. Parameters ---------- i The validation step number(s) from which the 'critical' status is obtained. Can be provided as a list of integers or a single integer. If `None`, all steps are included. scalar If `True` and `i=` is a scalar, return the value as a scalar instead of a dictionary. Returns ------- dict[int, bool] | bool A dictionary of the 'critical' status for each validation step or a scalar value. Examples -------- In the example below, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and `c`). There will be three validation steps, and the first step will have many failing test units, the rest will be completely passing. We've set thresholds here for each of the steps by using `thresholds=(2, 4, 5)`, which means: - the 'warning' threshold is `2` failing test units - the 'error' threshold is `4` failing test units - the 'critical' threshold is `5` failing test units After interrogation, the `critical()` method is used to determine the 'critical' status for each validation step. ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": [2, 4, 4, 7, 2, 3, 8], "b": [9, 8, 10, 5, 10, 6, 2], "c": ["a", "b", "a", "a", "b", "b", "a"] } ) validation = ( pb.Validate(data=tbl, thresholds=(2, 4, 5)) .col_vals_gt(columns="a", value=5) .col_vals_lt(columns="b", value=15) .col_vals_in_set(columns="c", set=["a", "b"]) .interrogate() ) validation.critical() ``` The returned dictionary provides the 'critical' status for each validation step. The first step has a `True` value since the number of failing test units meets the threshold for the 'critical' level. The second and third steps have `False` values since the number of failing test units was `0`, which is below the threshold for the 'critical' level. We can also visually inspect the 'critical' status across all steps by viewing the validation table: ```python validation ``` We can see that there are filled gray, yellow, and red circles in the first step (far right side, in the `W`, `E`, and `C` columns) indicating that the 'warning', 'error', and 'critical' thresholds were met. The other steps have empty gray, yellow, and red circles. This means that thresholds were 'set but not met' in those steps. If we wanted to check the 'critical' status for a single validation step, we can provide the step number. Also, we could have the value returned as a scalar by setting `scalar=True` (ensuring that `i=` is a scalar). ```python validation.critical(i=1) ``` The returned value is `True`, indicating that the first validation step had the 'critical' threshold met. ## The Inspection and Assistance family The *Inspection and Assistance* group contains functions that are helpful for getting to grips on a new data table. Use the `DataScan` class to get a quick overview of the data, `preview()` to see the first and last few rows of a table, `col_summary_tbl()` for a column-level summary of a table, `missing_vals_tbl()` to see where there are missing values in a table, and `get_column_count()`/`get_row_count()` to get the number of columns and rows in a table. Several datasets included in the package can be accessed via the `load_dataset()` function. Finally, the `config()` utility lets us set global configuration parameters. Want to chat with an assistant? Use the `assistant()` function to get help with Pointblank. DataScan(data: 'IntoFrameT', tbl_name: 'str | None' = None) -> 'None' Get a summary of a dataset. The `DataScan` class provides a way to get a summary of a dataset. The summary includes the following information: - the name of the table (if provided) - the type of the table (e.g., `"polars"`, `"pandas"`, etc.) - the number of rows and columns in the table - column-level information, including: - the column name - the column type - measures of missingness and distinctness - measures of negative, zero, and positive values (for numerical columns) - a sample of the data (the first 5 values) - statistics (if the column contains numbers, strings, or datetimes) To obtain a dictionary representation of the summary, you can use the `to_dict()` method. To get a JSON representation of the summary, you can use the `to_json()` method. To save the JSON text to a file, the `save_to_json()` method could be used. :::{.callout-warning} The `DataScan()` class is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- data The data to scan and summarize. This could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a database connection string. tbl_name Optionally, the name of the table could be provided as `tbl_name`. Measures of Missingness and Distinctness ---------------------------------------- For each column, the following measures are provided: - `n_missing_values`: the number of missing values in the column - `f_missing_values`: the fraction of missing values in the column - `n_unique_values`: the number of unique values in the column - `f_unique_values`: the fraction of unique values in the column The fractions are calculated as the ratio of the measure to the total number of rows in the dataset. Counts and Fractions of Negative, Zero, and Positive Values ----------------------------------------------------------- For numerical columns, the following measures are provided: - `n_negative_values`: the number of negative values in the column - `f_negative_values`: the fraction of negative values in the column - `n_zero_values`: the number of zero values in the column - `f_zero_values`: the fraction of zero values in the column - `n_positive_values`: the number of positive values in the column - `f_positive_values`: the fraction of positive values in the column The fractions are calculated as the ratio of the measure to the total number of rows in the dataset. Statistics for Numerical and String Columns ------------------------------------------- For numerical and string columns, several statistical measures are provided. Please note that for string columms, the statistics are based on the lengths of the strings in the column. The following descriptive statistics are provided: - `mean`: the mean of the column - `std_dev`: the standard deviation of the column Additionally, the following quantiles are provided: - `min`: the minimum value in the column - `p05`: the 5th percentile of the column - `q_1`: the first quartile of the column - `med`: the median of the column - `q_3`: the third quartile of the column - `p95`: the 95th percentile of the column - `max`: the maximum value in the column - `iqr`: the interquartile range of the column Statistics for Date and Datetime Columns ---------------------------------------- For date/datetime columns, the following statistics are provided: - `min`: the minimum date/datetime in the column - `max`: the maximum date/datetime in the column Returns ------- DataScan A DataScan object. preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None' = None, n_head: 'int' = 5, n_tail: 'int' = 5, limit: 'int' = 50, show_row_numbers: 'bool' = True, max_col_width: 'int' = 250, min_tbl_width: 'int' = 500, incl_header: 'bool' = None) -> 'GT' Display a table preview that shows some rows from the top, some from the bottom. To get a quick look at the data in a table, we can use the `preview()` function to display a preview of the table. The function shows a subset of the rows from the start and end of the table, with the number of rows from the start and end determined by the `n_head=` and `n_tail=` parameters (set to `5` by default). This function works with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). The view is optimized for readability, with column names and data types displayed in a compact format. The column widths are sized to fit the column names, dtypes, and column content up to a configurable maximum width of `max_col_width=` pixels. The table can be scrolled horizontally to view even very large datasets. Since the output is a Great Tables (`GT`) object, it can be further customized using the `great_tables` API. Parameters ---------- data The table to preview, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. When providing a CSV or Parquet file path (as a string or `pathlib.Path` object), the file will be automatically loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports glob patterns, directories containing .parquet files, and Spark-style partitioned datasets. Connection strings enable direct database access via Ibis with optional table specification using the `::table_name` suffix. Read the *Supported Input Table Types* section for details on the supported table types. columns_subset The columns to display in the table, by default `None` (all columns are shown). This can be a string, a list of strings, a `Column` object, or a `ColumnSelector` object. The latter two options allow for more flexible column selection using column selector functions. Errors are raised if the column names provided don't match any columns in the table (when provided as a string or list of strings) or if column selector expressions don't resolve to any columns. n_head The number of rows to show from the start of the table. Set to `5` by default. n_tail The number of rows to show from the end of the table. Set to `5` by default. limit The limit value for the sum of `n_head=` and `n_tail=` (the total number of rows shown). If the sum of `n_head=` and `n_tail=` exceeds the limit, an error is raised. The default value is `50`. show_row_numbers Should row numbers be shown? The numbers shown reflect the row numbers of the head and tail in the input `data=` table. By default, this is set to `True`. max_col_width The maximum width of the columns (in pixels) before the text is truncated. The default value is `250` (`"250px"`). min_tbl_width The minimum width of the table in pixels. If the sum of the column widths is less than this value, the all columns are sized up to reach this minimum width value. The default value is `500` (`"500px"`). incl_header Should the table include a header with the table type and table dimensions? Set to `True` by default. Returns ------- GT A GT object that displays the preview of the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `preview()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Examples -------- It's easy to preview a table using the `preview()` function. Here's an example using the `small_table` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): This table is a Polars DataFrame, but the `preview()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```python small_table_duckdb = pb.load_dataset("small_table", tbl_type="duckdb") pb.preview(small_table_duckdb) ``` The blue dividing line marks the end of the first `n_head=` rows and the start of the last `n_tail=` rows. We can adjust the number of rows shown from the start and end of the table by setting the `n_head=` and `n_tail=` parameters. Let's enlarge each of these to `10`: ```python pb.preview(small_table_polars, n_head=10, n_tail=10) ``` In the above case, the entire dataset is shown since the sum of `n_head=` and `n_tail=` is greater than the number of rows in the table (which is 13). The `columns_subset=` parameter can be used to show only specific columns in the table. You can provide a list of column names to make the selection. Let's try that with the `"game_revenue"` dataset as a Pandas DataFrame: ```python game_revenue_pandas = pb.load_dataset("game_revenue", tbl_type="pandas") pb.preview(game_revenue_pandas, columns_subset=["player_id", "item_name", "item_revenue"]) ``` Alternatively, we can use column selector functions like [`starts_with()`](`pointblank.starts_with`) and [`matches()`](`pointblank.matches`)` to select columns based on text or patterns: ```python pb.preview(game_revenue_pandas, n_head=2, n_tail=2, columns_subset=pb.starts_with("session")) ``` Multiple column selector functions can be combined within [`col()`](`pointblank.col`) using operators like `|` and `&`: ```python pb.preview( game_revenue_pandas, n_head=2, n_tail=2, columns_subset=pb.col(pb.starts_with("item") | pb.matches("player")) ) ``` ### Working with CSV Files The `preview()` function can directly accept CSV file paths, making it easy to preview data stored in CSV files without manual loading: You can also use a Path object to specify the CSV file: ### Working with Parquet Files The `preview()` function can directly accept Parquet files and datasets in various formats: You can also use glob patterns and directories: ```python # Multiple Parquet files with glob patterns pb.preview("data/sales_*.parquet") # Directory containing Parquet files pb.preview("parquet_data/") # Partitioned Parquet dataset pb.preview("sales_data/") # Auto-discovers partition columns ``` ### Working with Database Connection Strings The `preview()` function supports database connection strings for direct preview of database tables. Connection strings must specify a table using the `::table_name` suffix: For comprehensive documentation on supported connection string formats, error handling, and installation requirements, see the [`connect_to_table()`](`pointblank.connect_to_table`) function. col_summary_tbl(data: 'FrameT | Any', tbl_name: 'str | None' = None) -> 'GT' Generate a column-level summary table of a dataset. The `col_summary_tbl()` function generates a summary table of a dataset, focusing on providing column-level information about the dataset. The summary includes the following information: - the type of the table (e.g., `"polars"`, `"pandas"`, etc.) - the number of rows and columns in the table - column-level information, including: - the column name - the column type - measures of missingness and distinctness - descriptive stats and quantiles - statistics for datetime columns The summary table is returned as a GT object, which can be displayed in a notebook or saved to an HTML file. :::{.callout-warning} The `col_summary_tbl()` function is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- data The table to summarize, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. tbl_name Optionally, the name of the table could be provided as `tbl_name=`. Returns ------- GT A GT object that displays the column-level summaries of the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - GitHub URLs (direct links to CSV or Parquet files on GitHub) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `col_summary_tbl()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. Examples -------- It's easy to get a column-level summary of a table using the `col_summary_tbl()` function. Here's an example using the `small_table` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): This table used above was a Polars DataFrame, but the `col_summary_tbl()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```python nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") pb.col_summary_tbl(data=nycflights, tbl_name="nycflights") ``` missing_vals_tbl(data: 'FrameT | Any') -> 'GT' Display a table that shows the missing values in the input table. The `missing_vals_tbl()` function generates a table that shows the missing values in the input table. The table is displayed using the Great Tables API, which allows for further customization of the table's appearance if so desired. Parameters ---------- data The table for which to display the missing values. This could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. Returns ------- GT A GT object that displays the table of missing values in the input table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `missing_vals_tbl()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. The Missing Values Table ------------------------ The missing values table shows the proportion of missing values in each column of the input table. The table is divided into sectors, with each sector representing a range of rows in the table. The proportion of missing values in each sector is calculated for each column. The table is displayed using the Great Tables API, which allows for further customization of the table's appearance. To ensure that the table can scale to tables with many columns, each row in the reporting table represents a column in the input table. There are 10 sectors shown in the table, where the first sector represents the first 10% of the rows, the second sector represents the next 10% of the rows, and so on. Any sectors that are light blue indicate that there are no missing values in that sector. If there are missing values, the proportion of missing values is shown by a gray color (light gray for low proportions, dark gray to black for very high proportions). Examples -------- The `missing_vals_tbl()` function is useful for quickly identifying columns with missing values in a table. Here's an example using the `nycflights` dataset (loaded as a Polars DataFrame using the [`load_dataset()`](`pointblank.load_dataset`) function): The table shows the proportion of missing values in each column of the `nycflights` dataset. The table is divided into sectors, with each sector representing a range of rows in the table (with around 34,000 rows per sector). The proportion of missing values in each sector is calculated for each column. The various shades of gray indicate the proportion of missing values in each sector. Many columns have no missing values at all, and those sectors are colored light blue. assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | None' = None, api_key: 'str | None' = None, display: 'str | None' = None) -> 'None' Chat with the PbA (Pointblank Assistant) about your data validation needs. The `assistant()` function provides an interactive chat session with the PbA (Pointblank Assistant) to help you with your data validation needs. The PbA can help you with constructing validation plans, suggesting validation methods, and providing code snippets for using the Pointblank Python package. Feel free to ask the PbA about any aspect of the Pointblank package and it will do its best to assist you. The PbA can also help you with constructing validation plans for your data tables. If you provide a data table to the PbA, it will internally generate a JSON summary of the table and use that information to suggest validation methods that can be used with the Pointblank package. If using a Polars table as the data source, the PbA will be knowledgeable about the Polars API and can smartly suggest validation steps that use aggregate measures with up-to-date Polars methods. The PbA can be used with models from the following providers: - Anthropic - OpenAI - Ollama - Amazon Bedrock The PbA can be displayed in a browser (the default) or in the terminal. You can choose one or the other by setting the `display=` parameter to `"browser"` or `"terminal"`. :::{.callout-warning} The `assistant()` function is still experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- model The model to be used. This should be in the form of `provider:model` (e.g., `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`, `"ollama"`, and `"bedrock"`. data An optional data table to focus on during discussion with the PbA, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. tbl_name : str, optional The name of the data table. This is optional and is only used to provide a more detailed prompt to the PbA. api_key : str, optional The API key to be used for the model. display : str, optional The display mode to use for the chat session. Supported values are `"browser"` and `"terminal"`. If not provided, the default value is `"browser"`. Returns ------- None Nothing is returned. Rather, you get an an interactive chat session with the PbA, which is displayed in a browser or in the terminal. Constructing the `model` Argument --------------------------------- The `model=` argument should be constructed using the provider and model name separated by a colon (`provider:model`). The provider text can any of: - `"anthropic"` (Anthropic) - `"openai"` (OpenAI) - `"ollama"` (Ollama) - `"bedrock"` (Amazon Bedrock) The model name should be the specific model to be used from the provider. Model names are subject to change so consult the provider's documentation for the most up-to-date model names. Notes on Authentication ----------------------- Providing a valid API key as a string in the `api_key` argument is adequate for getting started but you should consider using a more secure method for handling API keys. One way to do this is to load the API key from an environent variable and retrieve it using the `os` module (specifically the `os.getenv()` function). Places to store the API key might include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`. Another solution is to store one or more model provider API keys in an `.env` file (in the root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or `OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file and there's no need to provide the `api_key` argument. An `.env` file might look like this: ```plaintext ANTHROPIC_API_KEY="your_anthropic_api_key_here" OPENAI_API_KEY="your_openai_api_key_here" ``` There's no need to have the `python-dotenv` package installed when using `.env` files in this way. Notes on Data Sent to the Model Provider ---------------------------------------- If `data=` is provided then that data is sent to the model provider is a JSON summary of the table. This data summary is generated internally by use of the `DataScan` class. The summary includes the following information: - the number of rows and columns in the table - the type of dataset (e.g., Polars, DuckDB, Pandas, etc.) - the column names and their types - column level statistics such as the number of missing values, min, max, mean, and median, etc. - a short list of data values in each column The JSON summary is used to provide the model with the necessary information be knowledgable about the data table. Compared to the size of the entire table, the JSON summary is quite small and can be safely sent to the model provider. The Amazon Bedrock provider is a special case since it is a self-hosted model and security controls are in place to ensure that data is kept within the user's AWS environment. If using an Ollama model all data is handled locally. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `assistant()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. load_dataset(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', tbl_type: "Literal['polars', 'pandas', 'duckdb']" = 'polars') -> 'FrameT | Any' Load a dataset hosted in the library as specified table type. The Pointblank library includes several datasets that can be loaded using the `load_dataset()` function. The datasets can be loaded as a Polars DataFrame, a Pandas DataFrame, or as a DuckDB table (which uses the Ibis library backend). These datasets are used throughout the documentation's examples to demonstrate the functionality of the library. They're also useful for experimenting with the library and trying out different validation scenarios. Parameters ---------- dataset The name of the dataset to load. Current options are `"small_table"`, `"game_revenue"`, `"nycflights"`, and `"global_sales"`. tbl_type The type of table to generate from the dataset. The named options are `"polars"`, `"pandas"`, and `"duckdb"`. Returns ------- FrameT | Any The dataset for the `Validate` object. This could be a Polars DataFrame, a Pandas DataFrame, or a DuckDB table as an Ibis table. Included Datasets ----------------- There are three included datasets that can be loaded using the `load_dataset()` function: - `"small_table"`: A small dataset with 13 rows and 8 columns. This dataset is useful for testing and demonstration purposes. - `"game_revenue"`: A dataset with 2000 rows and 11 columns. Provides revenue data for a game development company. For the particular game, there are records of player sessions, the items they purchased, ads viewed, and the revenue generated. - `"nycflights"`: A dataset with 336,776 rows and 18 columns. This dataset provides information about flights departing from New York City airports (JFK, LGA, or EWR) in 2013. - `"global_sales"`: A dataset with 50,000 rows and 20 columns. Provides information about global sales of products across different regions and countries. Supported DataFrame Types ------------------------- The `tbl_type=` parameter can be set to one of the following: - `"polars"`: A Polars DataFrame. - `"pandas"`: A Pandas DataFrame. - `"duckdb"`: An Ibis table for a DuckDB database. Examples -------- Load the `"small_table"` dataset as a Polars DataFrame by calling `load_dataset()` with `dataset="small_table"` and `tbl_type="polars"`: Note that the `"small_table"` dataset is a Polars DataFrame and using the [`preview()`](`pointblank.preview`) function will display the table in an HTML viewing environment. The `"game_revenue"` dataset can be loaded as a Pandas DataFrame by specifying the dataset name and setting `tbl_type="pandas"`: ```python game_revenue = pb.load_dataset(dataset="game_revenue", tbl_type="pandas") pb.preview(game_revenue) ``` The `"game_revenue"` dataset is a more real-world dataset with a mix of data types, and it's significantly larger than the `small_table` dataset at 2000 rows and 11 columns. The `"nycflights"` dataset can be loaded as a DuckDB table by specifying the dataset name and setting `tbl_type="duckdb"`: ```python nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb") pb.preview(nycflights) ``` The `"nycflights"` dataset is a large dataset with 336,776 rows and 18 columns. This dataset is truly a real-world dataset and provides information about flights originating from New York City airports in 2013. Finally, the `"global_sales"` dataset can be loaded as a Polars table by specifying the dataset name. Since `tbl_type=` is set to `"polars"` by default, we don't need to specify it: ```python global_sales = pb.load_dataset(dataset="global_sales") pb.preview(global_sales) ``` The `"global_sales"` dataset is a large dataset with 50,000 rows and 20 columns. Each record describes the sales of a particular product to a customer located in one of three global regions: North America, Europe, or Asia. get_data_path(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', file_type: "Literal['csv', 'parquet', 'duckdb']" = 'csv') -> 'str' Get the file path to a dataset included with the Pointblank package. This function provides direct access to the file paths of datasets included with Pointblank. These paths can be used in examples and documentation to demonstrate file-based data loading without requiring the actual data files. The returned paths can be used with `Validate(data=path)` to demonstrate CSV and Parquet file loading capabilities. Parameters ---------- dataset The name of the dataset to get the path for. Current options are `"small_table"`, `"game_revenue"`, `"nycflights"`, and `"global_sales"`. file_type The file format to get the path for. Options are `"csv"`, `"parquet"`, or `"duckdb"`. Returns ------- str The file path to the requested dataset file. Included Datasets ----------------- The available datasets are the same as those in [`load_dataset()`](`pointblank.load_dataset`): - `"small_table"`: A small dataset with 13 rows and 8 columns. Ideal for testing and examples. - `"game_revenue"`: A dataset with 2000 rows and 11 columns. Revenue data for a game company. - `"nycflights"`: A dataset with 336,776 rows and 18 columns. Flight data from NYC airports. - `"global_sales"`: A dataset with 50,000 rows and 20 columns. Global sales data across regions. File Types ---------- Each dataset is available in multiple formats: - `"csv"`: Comma-separated values file (`.csv`) - `"parquet"`: Parquet file (`.parquet`) - `"duckdb"`: DuckDB database file (`.ddb`) Examples -------- Get the path to a CSV file and use it with `Validate`: ```python import pointblank as pb # Get path to the small_table CSV file csv_path = pb.get_data_path("small_table", "csv") print(csv_path) # Use the path directly with Validate validation = ( pb.Validate(data=csv_path) .col_exists(["a", "b", "c"]) .col_vals_gt(columns="d", value=0) .interrogate() ) validation ``` Get a Parquet file path for validation examples: ```python # Get path to the game_revenue Parquet file parquet_path = pb.get_data_path(dataset="game_revenue", file_type="parquet") # Validate the Parquet file directly validation = ( pb.Validate(data=parquet_path, label="Game Revenue Data Validation") .col_vals_not_null(columns=["player_id", "session_id"]) .col_vals_gt(columns="item_revenue", value=0) .interrogate() ) validation ``` This is particularly useful for documentation examples where you want to demonstrate file-based workflows without requiring users to have specific data files: ```python # Example showing CSV file validation sales_csv = pb.get_data_path(dataset="global_sales", file_type="csv") validation = ( pb.Validate(data=sales_csv, label="Sales Data Validation") .col_exists(["customer_id", "product_id", "amount"]) .col_vals_regex(columns="customer_id", pattern=r"CUST_[0-9]{6}") .interrogate() ) ``` See Also -------- [`load_dataset()`](`pointblank.load_dataset`) for loading datasets directly as table objects. connect_to_table(connection_string: 'str') -> 'Any' Connect to a database table using a connection string. This utility function tests whether a connection string leads to a valid table and returns the table object if successful. It provides helpful error messages when no table is specified or when backend dependencies are missing. Parameters ---------- connection_string A database connection string with a required table specification using the `::table_name` suffix. Supported formats are outlined in the *Supported Connection String Formats* section. Returns ------- Any An Ibis table object for the specified database table. Supported Connection String Formats ----------------------------------- The `connection_string` parameter must include a valid connection string with a table name specified using the `::` syntax. Here are some examples on how to format connection strings for various backends: ``` DuckDB: "duckdb:///path/to/database.ddb::table_name" SQLite: "sqlite:///path/to/database.db::table_name" PostgreSQL: "postgresql://user:password@localhost:5432/database::table_name" MySQL: "mysql://user:password@localhost:3306/database::table_name" BigQuery: "bigquery://project/dataset::table_name" Snowflake: "snowflake://user:password@account/database/schema::table_name" ``` If the connection string does not include a table name, the function will attempt to connect to the database and list available tables, providing guidance on how to specify a table. Examples -------- Connect to a DuckDB table: ```python import pointblank as pb # Get path to a DuckDB database file from package data duckdb_path = pb.get_data_path("game_revenue", "duckdb") # Connect to the `game_revenue` table in the DuckDB database game_revenue = pb.connect_to_table(f"duckdb:///{duckdb_path}::game_revenue") # Use with the `preview()` function pb.preview(game_revenue) ``` Here are some backend-specific connection examples: ```python # PostgreSQL pg_table = pb.connect_to_table( "postgresql://user:password@localhost:5432/warehouse::customer_data" ) # SQLite sqlite_table = pb.connect_to_table("sqlite:///local_data.db::products") # BigQuery bq_table = pb.connect_to_table("bigquery://my-project/analytics::daily_metrics") ``` This function requires the Ibis library with appropriate backend drivers: ```bash # You can install a set of common backends: pip install 'ibis-framework[duckdb,postgres,mysql,sqlite]' # ...or specific backends as needed: pip install 'ibis-framework[duckdb]' # for DuckDB pip install 'ibis-framework[postgres]' # for PostgreSQL ``` ## The YAML family The *YAML* group contains functions that allow for the use of YAML to orchestrate validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow from YAML strings or files. The `validate_yaml()` function checks if the YAML configuration passes its own validity checks. The `yaml_to_python()` function converts YAML configuration to equivalent Python code. yaml_interrogate(yaml: 'Union[str, Path]', set_tbl: 'Union[FrameT, Any, None]' = None, namespaces: 'Optional[Union[Iterable[str], Mapping[str, str]]]' = None) -> 'Validate' Execute a YAML-based validation workflow. This is the main entry point for YAML-based validation workflows. It takes YAML configuration (as a string or file path) and returns a validated `Validate` object with interrogation results. The YAML configuration defines the data source, validation steps, and optional settings like thresholds and labels. This function automatically loads the data, builds the validation plan, executes all validation steps, and returns the interrogated results. Parameters ---------- yaml YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file. set_tbl An optional table to override the table specified in the YAML configuration. This allows you to apply a YAML-defined validation workflow to a different table than what's specified in the configuration. If provided, this table will replace the table defined in the YAML's `tbl` field before executing the validation workflow. This can be any supported table type including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths, GitHub URLs, or database connection strings. namespaces Optional module namespaces to make available for Python code execution in YAML configurations. Can be a dictionary mapping aliases to module names or a list of module names. See the "Using Namespaces" section below for detailed examples. Returns ------- Validate An instance of the `Validate` class that has been configured based on the YAML input. This object contains the results of the validation steps defined in the YAML configuration. It includes metadata like table name, label, language, and thresholds if specified. Raises ------ YAMLValidationError If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing required fields, unknown validation methods, or data loading failures. Using Namespaces ---------------- The `namespaces=` parameter enables custom Python modules and functions in YAML configurations. This is particularly useful for custom action functions and advanced Python expressions. **Namespace formats:** - Dictionary format: `{"alias": "module.name"}` maps aliases to module names - List format: `["module.name", "another.module"]` imports modules directly **Option 1: Inline expressions (no namespaces needed)** ```python import pointblank as pb # Simple inline custom action yaml_config = ''' tbl: small_table thresholds: warning: 0.01 actions: warning: python: "lambda: print('Custom warning triggered')" steps: - col_vals_gt: columns: [a] value: 1000 ''' result = pb.yaml_interrogate(yaml_config) result ``` **Option 2: External functions with namespaces** ```python # Define a custom action function def my_custom_action(): print("Data validation failed: please check your data.") # Add to current module for demo import sys sys.modules[__name__].my_custom_action = my_custom_action # YAML that references the external function yaml_config = ''' tbl: small_table thresholds: warning: 0.01 actions: warning: python: actions.my_custom_action steps: - col_vals_gt: columns: [a] value: 1000 # This will fail ''' # Use namespaces to make the function available result = pb.yaml_interrogate(yaml_config, namespaces={'actions': '__main__'}) result ``` This approach enables modular, reusable validation workflows with custom business logic. Examples -------- For the examples here, we'll use YAML configurations to define validation workflows. Let's start with a basic YAML workflow that validates the built-in `small_table` dataset. ```python import pointblank as pb # Define a basic YAML validation workflow yaml_config = ''' tbl: small_table steps: - rows_distinct - col_exists: columns: [date, a, b] ''' # Execute the validation workflow result = pb.yaml_interrogate(yaml_config) result ``` The validation table shows the results of our YAML-defined workflow. We can see that the `rows_distinct()` validation failed (because there are duplicate rows in the table), while the column existence checks passed. Now let's create a more comprehensive validation workflow with thresholds and metadata: ```python # Advanced YAML configuration with thresholds and metadata yaml_config = ''' tbl: small_table tbl_name: small_table_demo label: Comprehensive data validation thresholds: warning: 0.1 error: 0.25 critical: 0.35 steps: - col_vals_gt: columns: [d] value: 100 - col_vals_regex: columns: [b] pattern: '[0-9]-[a-z]{3}-[0-9]{3}' - col_vals_not_null: columns: [date, a] ''' # Execute the validation workflow result = pb.yaml_interrogate(yaml_config) print(f"Table name: {result.tbl_name}") print(f"Label: {result.label}") print(f"Total validation steps: {len(result.validation_info)}") ``` The validation results now include our custom table name and label. The thresholds we defined will determine when validation steps are marked as warnings, errors, or critical failures. You can also load YAML configurations from files. Here's how you would work with a YAML file: ```python from pathlib import Path import tempfile # Create a temporary YAML file for demonstration yaml_content = ''' tbl: small_table tbl_name: File-based Validation steps: - col_vals_between: columns: [c] left: 1 right: 10 - col_vals_in_set: columns: [f] set: [low, mid, high] ''' with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f: f.write(yaml_content) yaml_file_path = Path(f.name) # Load and execute validation from file result = pb.yaml_interrogate(yaml_file_path) result ``` This approach is particularly useful for storing validation configurations as part of your data pipeline or version control system, allowing you to maintain validation rules alongside your code. ### Using `set_tbl=` to Override the Table The `set_tbl=` parameter allows you to override the table specified in the YAML configuration. This is useful when you have a template validation workflow but want to apply it to different tables: ```python import polars as pl # Create a test table with similar structure to small_table test_table = pl.DataFrame({ "date": ["2023-01-01", "2023-01-02", "2023-01-03"], "a": [1, 2, 3], "b": ["1-abc-123", "2-def-456", "3-ghi-789"], "d": [150, 200, 250] }) # Use the same YAML config but apply it to our test table yaml_config = ''' tbl: small_table # This will be overridden tbl_name: Test Table # This name will be used steps: - col_exists: columns: [date, a, b, d] - col_vals_gt: columns: [d] value: 100 ''' # Execute with table override result = pb.yaml_interrogate(yaml_config, set_tbl=test_table) print(f"Validation applied to: {result.tbl_name}") result ``` This feature makes YAML configurations more reusable and flexible, allowing you to define validation logic once and apply it to multiple similar tables. validate_yaml(yaml: 'Union[str, Path]') -> 'None' Validate YAML configuration against the expected structure. This function validates that a YAML configuration conforms to the expected structure for validation workflows. It checks for required fields, proper data types, and valid validation method names. This is useful for validating configurations before execution or for building configuration editors and validators. The function performs comprehensive validation including: - required fields ('tbl' and 'steps') - proper data types for all fields - valid threshold configurations - known validation method names - proper step configuration structure Parameters ---------- yaml YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file. Raises ------ YAMLValidationError If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing required fields, unknown validation methods, or data loading failures. Examples -------- For the examples here, we'll demonstrate how to validate YAML configurations before using them with validation workflows. This is particularly useful for building robust data validation systems where you want to catch configuration errors early. Let's start with validating a basic configuration: ```python import pointblank as pb # Define a basic YAML validation configuration yaml_config = ''' tbl: small_table steps: - rows_distinct - col_exists: columns: [a, b] ''' # Validate the configuration: no exception means it's valid pb.validate_yaml(yaml_config) print("Basic YAML configuration is valid") ``` The function completed without raising an exception, which means our configuration is valid and follows the expected structure. Now let's validate a more complex configuration with thresholds and metadata: ```python # Complex YAML configuration with all optional fields yaml_config = ''' tbl: small_table tbl_name: My Dataset label: Quality check lang: en locale: en thresholds: warning: 0.1 error: 0.25 critical: 0.35 steps: - rows_distinct - col_vals_gt: columns: [d] value: 100 - col_vals_regex: columns: [b] pattern: '[0-9]-[a-z]{3}-[0-9]{3}' ''' # Validate the configuration pb.validate_yaml(yaml_config) print("Complex YAML configuration is valid") # Count the validation steps import pointblank.yaml as pby config = pby.load_yaml_config(yaml_config) print(f"Configuration has {len(config['steps'])} validation steps") ``` This configuration includes all the optional metadata fields and complex validation steps, demonstrating that the validation handles the full range of supported options. Let's see what happens when we try to validate an invalid configuration: ```python # Invalid YAML configuration: missing required 'tbl' field invalid_yaml = ''' steps: - rows_distinct ''' try: pb.validate_yaml(invalid_yaml) except pb.yaml.YAMLValidationError as e: print(f"Validation failed: {e}") ``` The validation correctly identifies that our configuration is missing the required `'tbl'` field. Here's a practical example of using validation in a workflow builder: ```python def safe_yaml_interrogate(yaml_config): """Safely execute a YAML configuration after validation.""" try: # Validate the YAML configuration first pb.validate_yaml(yaml_config) print("✓ YAML configuration is valid") # Then execute the workflow result = pb.yaml_interrogate(yaml_config) print(f"Validation completed with {len(result.validation_info)} steps") return result except pb.yaml.YAMLValidationError as e: print(f"Configuration error: {e}") return None # Test with a valid YAML configuration test_yaml = ''' tbl: small_table steps: - col_vals_between: columns: [c] left: 1 right: 10 ''' result = safe_yaml_interrogate(test_yaml) ``` This pattern of validating before executing helps build more reliable data validation pipelines by catching configuration errors early in the process. Note that this function only validates the structure and does not check if the specified data source ('tbl') exists or is accessible. Data source validation occurs during execution with `yaml_interrogate()`. See Also -------- yaml_interrogate : execute YAML-based validation workflows yaml_to_python(yaml: 'Union[str, Path]') -> 'str' Convert YAML validation configuration to equivalent Python code. This function takes a YAML validation configuration and generates the equivalent Python code that would produce the same validation workflow. This is useful for documentation, code generation, or learning how to translate YAML workflows into programmatic workflows. The generated Python code includes all necessary imports, data loading, validation steps, and interrogation execution, formatted as executable Python code. Parameters ---------- yaml YAML configuration as string or file path. Can be: (1) a YAML string containing the validation configuration, or (2) a Path object or string path to a YAML file. Returns ------- str A formatted Python code string enclosed in markdown code blocks that replicates the YAML workflow. The code includes import statements, data loading, validation method calls, and interrogation execution. Raises ------ YAMLValidationError If the YAML is invalid, malformed, or contains unknown validation methods. Examples -------- Convert a basic YAML configuration to Python code: ```python import pointblank as pb # Define a YAML validation workflow yaml_config = ''' tbl: small_table tbl_name: Data Quality Check steps: - col_vals_not_null: columns: [a, b] - col_vals_gt: columns: [c] value: 0 ''' # Generate equivalent Python code python_code = pb.yaml_to_python(yaml_config) print(python_code) ``` The generated Python code shows exactly how to replicate the YAML workflow programmatically. This is particularly useful when transitioning from YAML-based workflows to code-based workflows, or when generating documentation that shows both YAML and Python approaches. For more complex workflows with thresholds and metadata: ```python # Advanced YAML configuration yaml_config = ''' tbl: small_table tbl_name: Advanced Validation label: Production data check thresholds: warning: 0.1 error: 0.2 steps: - col_vals_between: columns: [c] left: 1 right: 10 - col_vals_regex: columns: [b] pattern: '[0-9]-[a-z]{3}-[0-9]{3}' ''' # Generate the equivalent Python code python_code = pb.yaml_to_python(yaml_config) print(python_code) ``` The generated code includes all configuration parameters, thresholds, and maintains the exact same validation logic as the original YAML workflow. This function is also useful for educational purposes, helping users understand how YAML configurations map to the underlying Python API calls. ## The Utility Functions family The Utility Functions group contains functions that are useful for accessing metadata about the target data. Use `get_column_count()` or `get_row_count()` to get the number of columns or rows in a table. The `get_action_metadata()` function is useful when building custom actions since it returns metadata about the validation step that's triggering the action. Lastly, the `config()` utility lets us set global configuration parameters. get_column_count(data: 'FrameT | Any') -> 'int' Get the number of columns in a table. The `get_column_count()` function returns the number of columns in a table. The function works with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports direct input of CSV files, Parquet files, and database connection strings. Parameters ---------- data The table for which to get the column count, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. Returns ------- int The number of columns in the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `get_column_count()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw content URLs for downloading. The URL format should be: `https://github.com/user/repo/blob/branch/path/file.csv` or `https://github.com/user/repo/blob/branch/path/file.parquet` Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Examples -------- To get the number of columns in a table, we can use the `get_column_count()` function. Here's an example using the `small_table` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): This table is a Polars DataFrame, but the `get_column_count()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```python small_table_duckdb = pb.load_dataset("small_table", tbl_type="duckdb") pb.get_column_count(small_table_duckdb) ``` #### Working with CSV Files The `get_column_count()` function can directly accept CSV file paths: #### Working with Parquet Files The function supports various Parquet input formats: You can also use glob patterns and directories: ```python # Multiple Parquet files with glob patterns pb.get_column_count("data/sales_*.parquet") # Directory containing Parquet files pb.get_column_count("parquet_data/") # Partitioned Parquet dataset pb.get_column_count("sales_data/") # Auto-discovers partition columns ``` #### Working with Database Connection Strings The function supports database connection strings for direct access to database tables: The function always returns the number of columns in the table as an integer value, which is `8` for the `small_table` dataset. get_row_count(data: 'FrameT | Any') -> 'int' Get the number of rows in a table. The `get_row_count()` function returns the number of rows in a table. The function works with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports direct input of CSV files, Parquet files, and database connection strings. Parameters ---------- data The table for which to get the row count, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the *Supported Input Table Types* section for details on the supported table types. Returns ------- int The number of rows in the table. Supported Input Table Types --------------------------- The `data=` parameter can be given any of the following table types: - Polars DataFrame (`"polars"`) - Pandas DataFrame (`"pandas"`) - PySpark table (`"pyspark"`) - DuckDB table (`"duckdb"`)* - MySQL table (`"mysql"`)* - PostgreSQL table (`"postgresql"`)* - SQLite table (`"sqlite"`)* - Microsoft SQL Server table (`"mssql"`)* - Snowflake table (`"snowflake"`)* - Databricks table (`"databricks"`)* - BigQuery table (`"bigquery"`)* - Parquet table (`"parquet"`)* - CSV files (string path or `pathlib.Path` object with `.csv` extension) - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet` extension, or partitioned dataset) - GitHub URLs (direct links to CSV or Parquet files on GitHub) - Database connection strings (URI format with optional table specification) The table types marked with an asterisk need to be prepared as Ibis tables (with type of `ibis.expr.types.relations.Table`). Furthermore, using `get_row_count()` with these types of tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed. To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback. GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw content URLs for downloading. The URL format should be: `https://github.com/user/repo/blob/branch/path/file.csv` or `https://github.com/user/repo/blob/branch/path/file.parquet` Connection strings follow database URL formats and must also specify a table using the `::table_name` suffix. Examples include: ``` "duckdb:///path/to/database.ddb::table_name" "sqlite:///path/to/database.db::table_name" "postgresql://user:password@localhost:5432/database::table_name" "mysql://user:password@localhost:3306/database::table_name" "bigquery://project/dataset::table_name" "snowflake://user:password@account/database/schema::table_name" ``` When using connection strings, the Ibis library with the appropriate backend driver is required. Examples -------- Getting the number of rows in a table is easily done by using the `get_row_count()` function. Here's an example using the `game_revenue` dataset (itself loaded using the [`load_dataset()`](`pointblank.load_dataset`) function): This table is a Polars DataFrame, but the `get_row_count()` function works with any table supported by `pointblank`, including Pandas DataFrames and Ibis backend tables. Here's an example using a DuckDB table handled by Ibis: ```python game_revenue_duckdb = pb.load_dataset("game_revenue", tbl_type="duckdb") pb.get_row_count(game_revenue_duckdb) ``` #### Working with CSV Files The `get_row_count()` function can directly accept CSV file paths: #### Working with Parquet Files The function supports various Parquet input formats: You can also use glob patterns and directories: ```python # Multiple Parquet files with glob patterns pb.get_row_count("data/sales_*.parquet") # Directory containing Parquet files pb.get_row_count("parquet_data/") # Partitioned Parquet dataset pb.get_row_count("sales_data/") # Auto-discovers partition columns ``` #### Working with Database Connection Strings The function supports database connection strings for direct access to database tables: The function always returns the number of rows in the table as an integer value, which is `2000` for the `game_revenue` dataset. get_action_metadata() -> 'dict | None' Access step-level metadata when authoring custom actions. Get the metadata for the validation step where an action was triggered. This can be called by user functions to get the metadata for the current action. This function can only be used within callables crafted for the [`Actions`](`pointblank.Actions`) class. Returns ------- dict | None A dictionary containing the metadata for the current step. If called outside of an action (i.e., when no action is being executed), this function will return `None`. Description of the Metadata Fields ---------------------------------- The metadata dictionary contains the following fields for a given validation step: - `step`: The step number. - `column`: The column name. - `value`: The value being compared (only available in certain validation steps). - `type`: The assertion type (e.g., `"col_vals_gt"`, etc.). - `time`: The time the validation step was executed (in ISO format). - `level`: The severity level (`"warning"`, `"error"`, or `"critical"`). - `level_num`: The severity level as a numeric value (`30`, `40`, or `50`). - `autobrief`: A localized and brief statement of the expectation for the step. - `failure_text`: Localized text that explains how the validation step failed. Examples -------- When creating a custom action, you can access the metadata for the current step using the `get_action_metadata()` function. Here's an example of a custom action that logs the metadata for the current step: ```python import pointblank as pb def log_issue(): metadata = pb.get_action_metadata() print(f"Type: {metadata['type']}, Step: {metadata['step']}") validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(warning=log_issue), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt( columns="session_duration", value=15, ) .interrogate() ) validation ``` Key pieces to note in the above example: - `log_issue()` (the custom action) collects `metadata` by calling `get_action_metadata()` - the `metadata` is a dictionary that is used to craft the log message - the action is passed as a bare function to the `Actions` object within the `Validate` object (placing it within `Validate(actions=)` ensures it's set as an action for every validation step) See Also -------- Have a look at [`Actions`](`pointblank.Actions`) for more information on how to create custom actions for validation steps that exceed a set threshold value. get_validation_summary() -> 'dict | None' Access validation summary information when authoring final actions. This function provides a convenient way to access summary information about the validation process within a final action. It returns a dictionary with key metrics from the validation process. This function can only be used within callables crafted for the [`FinalActions`](`pointblank.FinalActions`) class. Returns ------- dict | None A dictionary containing validation metrics. If called outside of an final action context, this function will return `None`. Description of the Summary Fields -------------------------------- The summary dictionary contains the following fields: - `n_steps` (`int`): The total number of validation steps. - `n_passing_steps` (`int`): The number of validation steps where all test units passed. - `n_failing_steps` (`int`): The number of validation steps that had some failing test units. - `n_warning_steps` (`int`): The number of steps that exceeded a 'warning' threshold. - `n_error_steps` (`int`): The number of steps that exceeded an 'error' threshold. - `n_critical_steps` (`int`): The number of steps that exceeded a 'critical' threshold. - `list_passing_steps` (`list[int]`): List of step numbers where all test units passed. - `list_failing_steps` (`list[int]`): List of step numbers for steps having failing test units. - `dict_n` (`dict`): The number of test units for each validation step. - `dict_n_passed` (`dict`): The number of test units that passed for each validation step. - `dict_n_failed` (`dict`): The number of test units that failed for each validation step. - `dict_f_passed` (`dict`): The fraction of test units that passed for each validation step. - `dict_f_failed` (`dict`): The fraction of test units that failed for each validation step. - `dict_warning` (`dict`): The 'warning' level status for each validation step. - `dict_error` (`dict`): The 'error' level status for each validation step. - `dict_critical` (`dict`): The 'critical' level status for each validation step. - `all_passed` (`bool`): Whether or not every validation step had no failing test units. - `highest_severity` (`str`): The highest severity level encountered during validation. This can be one of the following: `"warning"`, `"error"`, or `"critical"`, `"some failing"`, or `"all passed"`. - `tbl_row_count` (`int`): The number of rows in the target table. - `tbl_column_count` (`int`): The number of columns in the target table. - `tbl_name` (`str`): The name of the target table. - `validation_duration` (`float`): The duration of the validation in seconds. Note that the summary dictionary is only available within the context of a final action. If called outside of a final action (i.e., when no final action is being executed), this function will return `None`. Examples -------- Final actions are executed after the completion of all validation steps. They provide an opportunity to take appropriate actions based on the overall validation results. Here's an example of a final action function (`send_report()`) that sends an alert when critical validation failures are detected: ```python import pointblank as pb def send_report(): summary = pb.get_validation_summary() if summary["highest_severity"] == "critical": # Send an alert email send_alert_email( subject=f"CRITICAL validation failures in {summary['tbl_name']}", body=f"{summary['n_critical_steps']} steps failed with critical severity." ) validation = ( pb.Validate( data=my_data, final_actions=pb.FinalActions(send_report) ) .col_vals_gt(columns="revenue", value=0) .interrogate() ) ``` Note that `send_alert_email()` in the example above is a placeholder function that would be implemented by the user to send email alerts. This function is not provided by the Pointblank package. The `get_validation_summary()` function can also be used to create custom reporting for validation results: ```python def log_validation_results(): summary = pb.get_validation_summary() print(f"Validation completed with status: {summary['highest_severity'].upper()}") print(f"Steps: {summary['n_steps']} total") print(f" - {summary['n_passing_steps']} passing, {summary['n_failing_steps']} failing") print( f" - Severity: {summary['n_warning_steps']} warnings, " f"{summary['n_error_steps']} errors, " f"{summary['n_critical_steps']} critical" ) if summary['highest_severity'] in ["error", "critical"]: print("⚠️ Action required: Please review failing validation steps!") ``` Final actions work well with both simple logging and more complex notification systems, allowing you to integrate validation results into your broader data quality workflows. See Also -------- Have a look at [`FinalActions`](`pointblank.FinalActions`) for more information on how to create custom actions that are executed after all validation steps have been completed. write_file(validation: 'Validate', filename: 'str', path: 'str | None' = None, keep_tbl: 'bool' = False, keep_extracts: 'bool' = False, quiet: 'bool' = False) -> 'None' Write a Validate object to disk as a serialized file. Writing a validation object to disk with `write_file()` can be useful for keeping data validation results close at hand for later retrieval (with `read_file()`). By default, any data table that the validation object holds will be removed before writing to disk (not applicable if no data table is present). This behavior can be changed by setting `keep_tbl=True`, but this only works when the table is not of a database type (e.g., DuckDB, PostgreSQL, etc.), as database connections cannot be serialized. Extract data from failing validation steps can also be preserved by setting `keep_extracts=True`, which is useful for later analysis of data quality issues. The serialized file uses Python's pickle format for storage of the validation object state, including all validation results, metadata, and optionally the source data. **Important note.** If your validation uses custom preprocessing functions (via the `pre=` parameter), these functions must be defined at the module level (not interactively or as lambda functions) to ensure they can be properly restored when loading the validation in a different Python session. Read the *Creating Serializable Validations* section below for more information. :::{.callout-warning} The `write_file()` function is currently experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- validation The `Validate` object to write to disk. filename The filename to create on disk for the validation object. Should not include the file extension as `.pkl` will be added automatically. path An optional directory path where the file should be saved. If not provided, the file will be saved in the current working directory. The directory will be created if it doesn't exist. keep_tbl An option to keep the data table that is associated with the validation object. The default is `False` where the data table is removed before writing to disk. For database tables (e.g., Ibis tables with database backends), the table is always removed even if `keep_tbl=True`, as database connections cannot be serialized. keep_extracts An option to keep any collected extract data for failing rows from validation steps. By default, this is `False` (i.e., extract data is removed to save space). quiet Should the function not inform when the file is written? By default, this is `False`, so a message will be printed when the file is successfully written. Returns ------- None This function doesn't return anything but saves the validation object to disk. Creating Serializable Validations --------------------------------- To ensure your validations work reliably across different Python sessions, the recommended approach is to use module-Level functions. So, create a separate Python file for your preprocessing functions: ```python # preprocessing_functions.py import polars as pl def multiply_by_100(df): return df.with_columns(pl.col("value") * 100) def add_computed_column(df): return df.with_columns(computed=pl.col("value") * 2 + 10) ``` Then import and use them in your validation: ```python # your_main_script.py import pointblank as pb from preprocessing_functions import multiply_by_100, add_computed_column validation = ( pb.Validate(data=my_data) .col_vals_gt(columns="value", value=500, pre=multiply_by_100) .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column) .interrogate() ) # Save validation and it will work reliably across sessions pb.write_file(validation, "my_validation", keep_tbl=True) ``` ### Problematic Patterns to Avoid Don't use lambda functions as they will cause immediate errors. Don't use interactive function definitions (as they may fail when loading). ```python def my_function(df): # Defined in notebook/REPL return df.with_columns(pl.col("value") * 2) validation = pb.Validate(data).col_vals_gt( columns="value", value=100, pre=my_function ) ``` ### Automatic Analysis and Guidance When you call `write_file()`, it automatically analyzes your validation and provides: - confirmation when all functions will work reliably - warnings for functions that may cause cross-session issues - clear errors for unsupported patterns (lambda functions) - specific recommendations and code examples - loading instructions tailored to your validation ### Loading Your Validation To load a saved validation in a new Python session: ```python # In a new Python session import pointblank as pb # Import the same preprocessing functions used when creating the validation from preprocessing_functions import multiply_by_100, add_computed_column # Upon loading the validation, functions will be automatically restored validation = pb.read_file("my_validation.pkl") ``` ** Testing Your Validation:** To verify your validation works across sessions: 1. save your validation in one Python session 2. start a fresh Python session (restart kernel/interpreter) 3. import required preprocessing functions 4. load the validation using `read_file()` 5. test that preprocessing functions work as expected ### Performance and Storage - use `keep_tbl=False` (default) to reduce file size when you don't need the original data - use `keep_extracts=False` (default) to save space by excluding extract data - set `quiet=True` to suppress guidance messages in automated scripts - files are saved using pickle's highest protocol for optimal performance Examples -------- Let's create a simple validation and save it to disk: ```python import pointblank as pb # Create a validation validation = ( pb.Validate(data=pb.load_dataset("small_table"), label="My validation") .col_vals_gt(columns="d", value=100) .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}") .interrogate() ) # Save to disk (without the original table data) pb.write_file(validation, "my_validation") ``` To keep the original table data for later analysis: ```python # Save with the original table data included pb.write_file(validation, "my_validation_with_data", keep_tbl=True) ``` You can also specify a custom directory and keep extract data: ```python pb.write_file( validation, filename="detailed_validation", path="/path/to/validations", keep_tbl=True, keep_extracts=True ) ``` ### Working with Preprocessing Functions For validations that use preprocessing functions to be portable across sessions, define your functions in a separate `.py` file: ```python # In `preprocessing_functions.py` import polars as pl def multiply_by_100(df): return df.with_columns(pl.col("value") * 100) def add_computed_column(df): return df.with_columns(computed=pl.col("value") * 2 + 10) ``` Then import and use them in your validation: ```python # In your main script import pointblank as pb from preprocessing_functions import multiply_by_100, add_computed_column validation = ( pb.Validate(data=my_data) .col_vals_gt(columns="value", value=500, pre=multiply_by_100) .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column) .interrogate() ) # This validation can now be saved and loaded reliably pb.write_file(validation, "my_validation", keep_tbl=True) ``` When you load this validation in a new session, simply import the preprocessing functions again and they will be automatically restored. See Also -------- Use the [`read_file()`](`pointblank.read_file`) function to load a validation object that was previously saved with `write_file()`. read_file(filepath: 'str | Path') -> 'Validate' Read a Validate object from disk that was previously saved with `write_file()`. This function loads a validation object that was previously serialized to disk using the `write_file()` function. The validation object will be restored with all its validation results, metadata, and optionally the source data (if it was saved with `keep_tbl=True`). :::{.callout-warning} The `read_file()` function is currently experimental. Please report any issues you encounter in the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues). ::: Parameters ---------- filepath The path to the saved validation file. Can be a string or Path object. Returns ------- Validate The restored validation object with all its original state, validation results, and metadata. Examples -------- Load a validation object that was previously saved: ```python import pointblank as pb # Load a validation object from disk validation = pb.read_file("my_validation.pkl") # View the validation results validation ``` You can also load using just the filename (without extension): ```python # This will automatically look for "my_validation.pkl" validation = pb.read_file("my_validation") ``` The loaded validation object retains all its functionality: ```python # Get validation summary summary = validation.get_json_report() # Get sundered data (if original table was saved) if validation.data is not None: failing_rows = validation.get_sundered_data(type="fail") ``` See Also -------- Use the [`write_file()`](`pointblank.Validate.write_file`) method to save a validation object to disk for later retrieval with this function. config(report_incl_header: 'bool' = True, report_incl_footer: 'bool' = True, preview_incl_header: 'bool' = True) -> 'PointblankConfig' Configuration settings for the Pointblank library. Parameters ---------- report_incl_header This controls whether the header should be present in the validation table report. The header contains the table name, label information, and might contain global failure threshold levels (if set). report_incl_footer Should the footer of the validation table report be displayed? The footer contains the starting and ending times of the interrogation. preview_incl_header Whether the header should be present in any preview table (generated via the [`preview()`](`pointblank.preview`) function). Returns ------- PointblankConfig A `PointblankConfig` object with the specified configuration settings. ## The Prebuilt Actions family The Prebuilt Actions group contains a function that can be used to send a Slack notification when validation steps exceed failure threshold levels or just to provide a summary of the validation results, including the status, number of steps, passing and failing steps, table information, and timing details. send_slack_notification(webhook_url: 'str | None' = None, step_msg: 'str | None' = None, summary_msg: 'str | None' = None, debug: 'bool' = False) -> 'Callable' Create a Slack notification function using a webhook URL. This function can be used in two ways: 1. With [`Actions`](`pointblank.Actions`) to notify about individual validation step failures 2. With [`FinalActions`](`pointblank.FinalActions`) to provide a summary notification after all validation steps have undergone interrogation The function creates a callable that sends notifications through a Slack webhook. Message formatting can be customized using templates for both individual steps and summary reports. Parameters ---------- webhook_url The Slack webhook URL. If `None` (and `debug=True`), a dry run is performed (see the *Offline Testing* section below for information on this). step_msg Template string for step notifications. Some of the available variables include: `"{step}"`, `"{column}"`, `"{value}"`, `"{type}"`, `"{time}"`, `"{level}"`, etc. See the *Available Template Variables for Step Notifications* section below for more details. If not provided, a default step message template will be used. summary_msg Template string for summary notifications. Some of the available variables are: `"{n_steps}"`, `"{n_passing_steps}"`, `"{n_failing_steps}"`, `"{all_passed}"`, `"{highest_severity}"`, etc. See the *Available Template Variables for Summary Notifications* section below for more details. If not provided, a default summary message template will be used. debug Print debug information if `True`. This includes the message content and the response from Slack. This is useful for testing and debugging the notification function. If `webhook_url` is `None`, the function will print the message to the console instead of sending it to Slack. This is useful for debugging and ensuring that your templates are formatted correctly. Returns ------- Callable A function that sends notifications to Slack. Available Template Variables for Step Notifications --------------------------------------------------- When creating a custom template for validation step alerts (`step_msg=`), the following templating strings can be used: - `"{step}"`: The step number. - `"{column}"`: The column name. - `"{value}"`: The value being compared (only available in certain validation steps). - `"{type}"`: The assertion type (e.g., `"col_vals_gt"`, etc.). - `"{level}"`: The severity level (`"warning"`, `"error"`, or `"critical"`). - `"{level_num}"`: The severity level as a numeric value (`30`, `40`, or `50`). - `"{autobrief}"`: A localized and brief statement of the expectation for the step. - `"{failure_text}"`: Localized text that explains how the validation step failed. - `"{time}"`: The time of the notification. Here's an example of how to construct a `step_msg=` template: ```python step_msg = '''🚨 *Validation Step Alert* • Step Number: {step} • Column: {column} • Test Type: {type} • Value Tested: {value} • Severity: {level} (level {level_num}) • Brief: {autobrief} • Details: {failure_text} • Time: {time}''' ``` This template will be filled with the relevant information when a validation step fails. The placeholders will be replaced with actual values when the Slack notification is sent. Available Template Variables for Summary Notifications ------------------------------------------------------ When creating a custom template for a validation summary (`summary_msg=`), the following templating strings can be used: - `"{n_steps}"`: The total number of validation steps. - `"{n_passing_steps}"`: The number of validation steps where all test units passed. - `"{n_failing_steps}"`: The number of validation steps that had some failing test units. - `"{n_warning_steps}"`: The number of steps that exceeded a 'warning' threshold. - `"{n_error_steps}"`: The number of steps that exceeded an 'error' threshold. - `"{n_critical_steps}"`: The number of steps that exceeded a 'critical' threshold. - `"{all_passed}"`: Whether or not every validation step had no failing test units. - `"{highest_severity}"`: The highest severity level encountered during validation. This can be one of the following: `"warning"`, `"error"`, or `"critical"`, `"some failing"`, or `"all passed"`. - `"{tbl_row_count}"`: The number of rows in the target table. - `"{tbl_column_count}"`: The number of columns in the target table. - `"{tbl_name}"`: The name of the target table. - `"{validation_duration}"`: The duration of the validation in seconds. - `"{time}"`: The time of the notification. Here's an example of how to put together a `summary_msg=` template: ```python summary_msg = '''📊 *Validation Summary Report* *Overview* • Status: {highest_severity} • All Passed: {all_passed} • Total Steps: {n_steps} *Step Results* • Passing Steps: {n_passing_steps} • Failing Steps: {n_failing_steps} • Warning Level: {n_warning_steps} • Error Level: {n_error_steps} • Critical Level: {n_critical_steps} *Table Info* • Table Name: {tbl_name} • Row Count: {tbl_row_count} • Column Count: {tbl_column_count} *Timing* • Duration: {validation_duration}s • Completed: {time}''' ``` This template will be filled with the relevant information when the validation summary is generated. The placeholders will be replaced with actual values when the Slack notification is sent. Offline Testing --------------- If you want to test the function without sending actual notifications, you can leave the `webhook_url=` as `None` and set `debug=True`. This will print the message to the console instead of sending it to Slack. This is useful for debugging and ensuring that your templates are formatted correctly. Furthermore, the function could be run globally (i.e., outside of the context of a validation plan) to show the message templates with all possible variables. Here's an example of how to do this: ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url=None, # Leave as None for dry run debug=True, # Enable debug mode to print message previews ) # Call the function to see the message previews notify_slack() ``` This will print the step and summary message previews to the console, allowing you to see how the templates will look when filled with actual data. You can then adjust your templates as needed before using them in a real validation plan. When `step_msg=` and `summary_msg=` are not provided, the function will use default templates. However, you can customize the templates to include additional information or change the format to better suit your needs. Iterating on the templates can help you create more informative and visually appealing messages. Here's an example of that: ```python import pointblank as pb # Create a Slack notification function with custom templates notify_slack = pb.send_slack_notification( webhook_url=None, # Leave as None for dry run step_msg='''*Data Validation Alert* • Type: {type} • Level: {level} • Step: {step} • Column: {column} • Time: {time}''', summary_msg='''*Data Validation Summary* • Highest Severity: {highest_severity} • Total Steps: {n_steps} • Failed Steps: {n_failing_steps} • Time: {time}''', debug=True, # Enable debug mode to print message previews ) ``` These templates will be used with sample data when the function is called. The combination of `webhook_url=None` and `debug=True` allows you to test your custom templates without having to send actual notifications to Slack. Examples -------- When using an action with one or more validation steps, you typically provide callables that fire when a matched threshold of failed test units is exceeded. The callable can be a function or a lambda. The `send_slack_notification()` function creates a callable that sends a Slack notification when the validation step fails. Here is how it can be set up to work for multiple validation steps by using of [Actions](`pointblank.Actions`): ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url" ) # Create a validation plan validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(critical=notify_slack), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) validation ``` By placing the `notify_slack()` function in the `Validate(actions=Actions(critical=))` argument, you can ensure that the notification is sent whenever the 'critical' threshold is reached (as set here, when 15% or more of the test units fail). The notification will include information about the validation step that triggered the alert. When using a [`FinalActions`](`pointblank.FinalActions`) object, the notification will be sent after all validation steps have been completed. This is useful for providing a summary of the validation process. Here is an example of how to set up a summary notification: ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url" ) # Create a validation plan validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), final_actions=pb.FinalActions(notify_slack), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` In this case, the same `notify_slack()` function is used, but it is placed in `Validate(final_actions=FinalActions())`. This results in the summary notification being sent after all validation steps are completed, regardless of whether any steps failed or not. This simplicity is possible because the `send_slack_notification()` function creates a callable that can be used in both contexts. The function will automatically determine whether to send a step notification or a summary notification based on the context in which it is called. We can customize the message templates for both step and summary notifications. In that way, it's possible to create a more informative and visually appealing message. For example, we can use Markdown formatting to make the message more readable and visually appealing. Here is an example of how to customize the templates: ```python import pointblank as pb # Create a Slack notification function notify_slack = pb.send_slack_notification( webhook_url="https://hooks.slack.com/services/your/webhook/url", step_msg=''' 🚨 *Validation Step Alert* • Step Number: {step} • Column: {column} • Test Type: {type} • Value Tested: {value} • Severity: {level} (level {level_num}) • Brief: {autobrief} • Details: {failure_text} • Time: {time}''', summary_msg=''' 📊 *Validation Summary Report* *Overview* • Status: {highest_severity} • All Passed: {all_passed} • Total Steps: {n_steps} *Step Results* • Passing Steps: {n_passing_steps} • Failing Steps: {n_failing_steps} • Warning Level: {n_warning_steps} • Error Level: {n_error_steps} • Critical Level: {n_critical_steps} *Table Info* • Table Name: {tbl_name} • Row Count: {tbl_row_count} • Column Count: {tbl_column_count} *Timing* • Duration: {validation_duration}s • Completed: {time}''', ) # Create a validation plan validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds(warning=0.05, error=0.10, critical=0.15), actions=pb.Actions(default=notify_slack), final_actions=pb.FinalActions(notify_slack), ) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt(columns="session_duration", value=15) .interrogate() ) ``` In this example, we have customized the templates for both step and summary notifications. The step notification includes details about the validation step, including the step number, column name, test type, value tested, severity level, brief description, and time of the notification. The summary notification includes an overview of the validation process, including the status, number of steps, passing and failing steps, table information, and timing details. ---------------------------------------------------------------------- This is a set of examples for the Pointblank library. ---------------------------------------------------------------------- ### Starter Validation (https://posit-dev.github.io/pointblank/demos/01-starter/) A validation with the basics ```python import pointblank as pb validation = ( pb.Validate( # Use pb.Validate to start data=pb.load_dataset(dataset="small_table", tbl_type="polars"), tbl_name="small_table", label="A starter validation" ) .col_vals_gt(columns="d", value=1000) # STEP 1 | .col_vals_le(columns="c", value=5) # STEP 2 | <-- Build up a validation plan .col_exists(columns=["date", "date_time"]) # STEP 3 | .interrogate() # This will execute all validation steps and collect intel ) validation ``` ### Advanced Validation (https://posit-dev.github.io/pointblank/demos/02-advanced/) A validation with a comprehensive set of rules ```python import pointblank as pb import polars as pl validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="polars"), tbl_name="game_revenue", label="Comprehensive validation example", thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35), ) .col_vals_regex(columns="player_id", pattern=r"^[A-Z]{12}[0-9]{3}$") # STEP 1 .col_vals_gt(columns="session_duration", value=5) # STEP 2 .col_vals_ge(columns="item_revenue", value=0.02) # STEP 3 .col_vals_in_set(columns="item_type", set=["iap", "ad"]) # STEP 4 .col_vals_in_set( # STEP 5 columns="acquisition", set=["google", "facebook", "organic", "crosspromo", "other_campaign"] ) .col_vals_not_in_set(columns="country", set=["Mongolia", "Germany"]) # STEP 6 .col_vals_between( # STEP 7 columns="session_duration", left=10, right=50, pre = lambda df: df.select(pl.median("session_duration")) ) .rows_distinct(columns_subset=["player_id", "session_id", "time"]) # STEP 8 .row_count_match(count=2000) # STEP 9 .col_count_match(count=11) # STEP 10 .col_vals_not_null(columns=pb.starts_with("item")) # STEPS 11-13 .col_exists(columns="start_day") # STEP 14 .interrogate() ) validation ``` ### Data Extracts (https://posit-dev.github.io/pointblank/demos/03-data-extracts/) Pulling out data extracts that highlight rows with validation failures ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue"), tbl_name="game_revenue", label="Validation with test unit failures available as an extract" ) .col_vals_gt(columns="item_revenue", value=0) # STEP 1: no test unit failures .col_vals_ge(columns="session_duration", value=5) # STEP 2: 14 test unit failures -> extract .interrogate() ) ``` ```python pb.preview(validation.get_data_extracts(i=2, frame=True), n_head=20, n_tail=20) ``` ### Sundered Data (https://posit-dev.github.io/pointblank/demos/04-sundered-data/) Splitting your data into 'pass' and 'fail' subsets ```python import pointblank as pb import polars as pl validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="pandas"), tbl_name="small_table", label="Sundering Data" ) .col_vals_gt(columns="d", value=1000) .col_vals_le(columns="c", value=5) .interrogate() ) validation ``` ```python pb.preview(validation.get_sundered_data(type="pass")) ``` ### Step Report: Column Data Checks (https://posit-dev.github.io/pointblank/demos/05-step-report-column-check/) A step report for column checks shows what went wrong ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table"), tbl_name="small_table", label="Step reports for column data checks" ) .col_vals_ge(columns="c", value=4, na_pass=True) # has failing test units .col_vals_regex(columns="b", pattern=r"\d-[a-z]{3}-\d{3}") # no failing test units .interrogate() ) validation ``` ```python validation.get_step_report(i=1) ``` ```python validation.get_step_report(i=2) ``` ### Step Report: Schema Check (https://posit-dev.github.io/pointblank/demos/06-step-report-schema-check/) When a schema doesn't match, a step report gives you the details ```python import pointblank as pb # Create a schema for the target table (`small_table` as a DuckDB table) schema = pb.Schema( columns=[ ("date_time", "timestamp"), # this dtype doesn't match ("dates", "date"), # this column name doesn't match ("a", "int64"), ("b",), # omit dtype to not check for it ("c",), # "" "" "" "" ("d", "float64"), ("e", ["bool", "boolean"]), # try several dtypes (second one matches) ("f", "str"), # this dtype doesn't match ] ) # Use the `col_schema_match()` validation method to perform a schema check validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="duckdb"), tbl_name="small_table", label="Step report for a schema check" ) .col_schema_match(schema=schema) .interrogate() ) validation ``` ```python validation.get_step_report(i=1) ``` ### Apply Validation Rules to Multiple Columns (https://posit-dev.github.io/pointblank/demos/apply-checks-to-several-columns/) Create multiple validation steps by using a list of column names with `columns=` ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_ge(columns=["a", "c", "d"], value=0) # check values in 'a', 'c', and 'd' .col_exists(columns=["date_time", "date"]) # check for the existence of two columns .interrogate() ) validation ``` ### Verifying Row and Column Counts (https://posit-dev.github.io/pointblank/demos/check-row-column-counts/) Check the dimensions of the table with the `*_count_match()` validation methods ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb") ) .col_count_match(count=11) # expect 11 columns in the table .row_count_match(count=2000) # expect 2,000 rows in the table .row_count_match(count=0, inverse=True) # expect that the table has rows .col_count_match( # compare column count against count=pb.load_dataset( # that of another table dataset="game_revenue", tbl_type="pandas" ) ) .interrogate() ) validation ``` ### Checks for Missing Values (https://posit-dev.github.io/pointblank/demos/checks-for-missing/) Perform validations that check whether missing/NA/Null values are present ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_not_null(columns="a") # expect no Null values .col_vals_not_null(columns="b") # "" "" .col_vals_not_null(columns="c") # "" "" .col_vals_not_null(columns="d") # "" "" .col_vals_null(columns="a") # expect all values to be Null .interrogate() ) validation ``` ### Custom Expression for Checking Column Values (https://posit-dev.github.io/pointblank/demos/col-vals-custom-expr/) A column expression can be used to check column values. Just use `col_vals_expr()` for this ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="pandas") ) .col_vals_expr(expr=lambda df: (df["d"] % 1 != 0) & (df["a"] < 10)) # Pandas column expr .interrogate() ) validation ``` ### Column Selector Functions: Easily Pick Columns (https://posit-dev.github.io/pointblank/demos/column-selector-functions/) Use column selector functions in the `columns=` argument to conveniently choose columns ```python import pointblank as pb import narwhals.selectors as ncs validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="polars") ) .col_vals_ge( columns=pb.matches("rev|dur"), # check values in columns having 'rev' or 'dur' in name value=0 ) .col_vals_regex( columns=pb.ends_with("_id"), # check values in columns with names ending in '_id' pattern=r"^[A-Z]{12}\d{3}" ) .col_vals_not_null( columns=pb.last_n(2) # check that the last two columns don't have Null values ) .col_vals_regex( columns=ncs.string(), # check that all string columns are non-empty strings pattern=r"(.|\s)*\S(.|\s)*" ) .interrogate() ) validation ``` ### Comparison Checks Across Columns (https://posit-dev.github.io/pointblank/demos/comparisons-across-columns/) Perform comparisons of values in columns to values in other columns ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_lt(columns="a", value=pb.col("c")) # values in 'a' > values in 'c' .col_vals_between( columns="d", # values in 'd' are between values left=pb.col("c"), # in 'c' and the fixed value of 12,000; right=12000, # any missing values encountered result na_pass=True # in a passing test unit ) .interrogate() ) validation ``` ### Expect No Duplicate Rows (https://posit-dev.github.io/pointblank/demos/expect-no-duplicate-rows/) We can check for duplicate rows in the table with `rows_distinct()` ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .rows_distinct() # expect no duplicate rows .interrogate() ) validation ``` ### Checking for Duplicate Values (https://posit-dev.github.io/pointblank/demos/expect-no-duplicate-values/) To check for duplicate values down a column, use `rows_distinct()` with a `columns_subset=` value ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .rows_distinct(columns_subset="b") # expect no duplicate values in 'b' .interrogate() ) validation ``` ### Expectations with a Text Pattern (https://posit-dev.github.io/pointblank/demos/expect-text-pattern/) With the `col_vals_regex()`, check for conformance to a regular expression ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_regex(columns="b", pattern=r"^\d-[a-z]{3}-\d{3}$") # check pattern in 'b' .col_vals_regex(columns="f", pattern=r"high|low|mid") # check pattern in 'f' .interrogate() ) validation ``` ### Set Failure Threshold Levels (https://posit-dev.github.io/pointblank/demos/failure-thresholds/) Set threshold levels to better gauge adverse data quality ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="game_revenue", tbl_type="duckdb"), thresholds=pb.Thresholds( # setting relative threshold defaults for all steps warning=0.05, # 5% failing test units: warning threshold (gray) error=0.10, # 10% failed test units: error threshold (yellow) critical=0.15 # 15% failed test units: critical threshold (red) ), ) .col_vals_in_set(columns="item_type", set=["iap", "ad"]) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .col_vals_gt(columns="item_revenue", value=0.05) .col_vals_gt( columns="session_duration", value=4, thresholds=(5, 10, 20) # setting absolute thresholds for *this* step (W, E, C) ) .col_exists(columns="end_day") .interrogate() ) validation ``` ### Mutate the Table in a Validation Step (https://posit-dev.github.io/pointblank/demos/mutate-table-in-step/) For far more specialized validations, modify the table with the `pre=` argument before checking it ```python import pointblank as pb import polars as pl import narwhals as nw # Define preprocessing functions def get_median_a(df): """Use a Polars expression to aggregate column `a`.""" return df.select(pl.median("a")) def add_b_length_column(df): """Use Narwhals to add a string length column `b_len`.""" return ( nw.from_native(df) .with_columns(b_len=nw.col("b").str.len_chars()) ) validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_between( columns="a", left=3, right=6, pre=get_median_a ) .col_vals_eq( columns="b_len", value=9, pre=add_b_length_column ) .interrogate() ) validation ``` ### Numeric Comparisons (https://posit-dev.github.io/pointblank/demos/numeric-comparisons/) Perform comparisons of values in columns to fixed values ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_gt(columns="d", value=1000) # values in 'd' > 1000 .col_vals_lt(columns="d", value=10000) # values in 'd' < 10000 .col_vals_ge(columns="a", value=1) # values in 'a' >= 1 .col_vals_le(columns="c", value=5) # values in 'c' <= 5 .col_vals_ne(columns="a", value=7) # values in 'a' not equal to 7 .col_vals_between(columns="c", left=0, right=15) # 0 <= 'c' values <= 15 .interrogate() ) validation ``` ### Check the Schema of a Table (https://posit-dev.github.io/pointblank/demos/schema-check/) The schema of a table can be flexibly defined with `Schema` and verified with `col_schema_match()` ```python import pointblank as pb import polars as pl tbl = pl.DataFrame( { "a": ["apple", "banana", "cherry", "date"], "b": [1, 6, 3, 5], "c": [1.1, 2.2, 3.3, 4.4], } ) # Use the Schema class to define the column schema as loosely or rigorously as required schema = pb.Schema( columns=[ ("a", "String"), # Column 'a' has dtype 'String' ("b", ["Int", "Int64"]), # Column 'b' has dtype 'Int' or 'Int64' ("c", ) # Column 'c' follows 'b' but we don't specify a dtype here ] ) # Use the `col_schema_match()` validation method to perform the schema check validation = ( pb.Validate(data=tbl) .col_schema_match(schema=schema) .interrogate() ) validation ``` ### Set Membership (https://posit-dev.github.io/pointblank/demos/set-membership/) Perform validations that check whether values are part of a set (or *not* part of one) ```python import pointblank as pb validation = ( pb.Validate( data=pb.load_dataset(dataset="small_table", tbl_type="polars") ) .col_vals_in_set(columns="f", set=["low", "mid", "high"]) # part of this set .col_vals_not_in_set(columns="f", set=["zero", "infinity"]) # not part of this set .interrogate() ) validation ``` ### Using Parquet Data (https://posit-dev.github.io/pointblank/demos/using-parquet-data/) A Parquet dataset can be used for data validation, thanks to Ibis ```python import pointblank as pb import ibis game_revenue = ibis.read_parquet("data/game_revenue.parquet") validation = ( pb.Validate(data=game_revenue, label="Example using a Parquet dataset.") .col_vals_lt(columns="item_revenue", value=200) .col_vals_gt(columns="item_revenue", value=0) .col_vals_gt(columns="session_duration", value=5) .col_vals_in_set(columns="item_type", set=["iap", "ad"]) .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}\d{3}") .interrogate() ) validation ```