Validateclass

Workflow for defining a set of validations on a table and interrogating for results.

USAGE

Validate(
    data,
    tbl_name=None,
    label=None,
    thresholds=None,
    actions=None,
    final_actions=None,
    brief=None,
    lang=None,
    locale=None,
)

The Validate class is used for defining a set of validation steps on a table and interrogating the table with the validation plan. This class is the main entry point for the data quality reporting workflow. The overall aim of this workflow is to generate comprehensive reporting information to assess the level of data quality for a target table.

We can supply as many validation steps as needed, and having a large number of them should increase the validation coverage for a given table. The validation methods (e.g., col_vals_gt(), col_vals_between(), etc.) translate to discrete validation steps, where each step will be sequentially numbered (useful when viewing the reporting data). This process of calling validation methods is known as developing a validation plan.

The validation methods, when called, are merely instructions up to the point the concluding interrogate() method is called. That kicks off the process of acting on the validation plan by querying the target table getting reporting results for each step. Once the interrogation process is complete, we can say that the workflow now has reporting information. We can then extract useful information from the reporting data to understand the quality of the table. Printing the Validate object (or using the get_tabular_report() method) will return a table with the results of the interrogation and get_sundered_data() allows for the splitting of the table based on passing and failing rows.

Parameters

data : FrameT | Any

The table to validate, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a database connection string. When providing a CSV or Parquet file path (as a string or pathlib.Path object), the file will be automatically loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports glob patterns, directories containing .parquet files, and Spark-style partitioned datasets. GitHub URLs are automatically transformed to raw content URLs and downloaded. Connection strings enable direct database access via Ibis with optional table specification using the ::table_name suffix. Read the Supported Input Table Types section for details on the supported table types.

tbl_name : str | None = None

An optional name to assign to the input table object. If no value is provided, a name will be generated based on whatever information is available. This table name will be displayed in the header area of the tabular report.

label : str | None = None

An optional label for the validation plan. If no value is provided, a label will be generated based on the current system date and time. Markdown can be used here to make the label more visually appealing (it will appear in the header area of the tabular report).

thresholds : int | float | bool | tuple | dict | Thresholds | None = None

Generate threshold failure levels so that all validation steps can report and react accordingly when exceeding the set levels. The thresholds are set at the global level and can be overridden at the validation step level (each validation step has its own thresholds= parameter). The default is None, which means that no thresholds will be set. Look at the Thresholds section for information on how to set threshold levels.

actions : Actions | None = None

The actions to take when validation steps meet or exceed any set threshold levels. These actions are paired with the threshold levels and are executed during the interrogation process when there are exceedances. The actions are executed right after each step is evaluated. Such actions should be provided in the form of an Actions object. If None then no global actions will be set. View the Actions section for information on how to set actions.

final_actions : FinalActions | None = None

The actions to take when the validation process is complete and the final results are available. This is useful for sending notifications or reporting the overall status of the validation process. The final actions are executed after all validation steps have been processed and the results have been collected. The final actions are not tied to any threshold levels, they are executed regardless of the validation results. Such actions should be provided in the form of a FinalActions object. If None then no finalizing actions will be set. Please see the Actions section for information on how to set final actions.

brief : str | bool | None = None

A global setting for briefs, which are optional brief descriptions for validation steps (they be displayed in the reporting table). For such a global setting, templating elements like "{step}" (to insert the step number) or "{auto}" (to include an automatically generated brief) are useful. If True then each brief will be automatically generated. If None (the default) then briefs aren’t globally set.

lang : str | None = None

The language to use for various reporting elements. By default, None will select English ("en") as the but other options include French ("fr"), German ("de"), Italian ("it"), Spanish ("es"), and several more. Have a look at the Reporting Languages section for the full list of supported languages and information on how the language setting is utilized.

locale : str | None = None

An optional locale ID to use for formatting values in the reporting table according the locale’s rules. Examples include "en-US" for English (United States) and "fr-FR" for French (France). More simply, this can be a language identifier without a designation of territory, like "es" for Spanish.

Returns

Validate

A Validate object with the table and validations to be performed.

Supported Input Table Types

The data= parameter can be given any of the following table types:

  • Polars DataFrame ("polars")
  • Pandas DataFrame ("pandas")
  • DuckDB table ("duckdb")*
  • MySQL table ("mysql")*
  • PostgreSQL table ("postgresql")*
  • SQLite table ("sqlite")*
  • Microsoft SQL Server table ("mssql")*
  • Snowflake table ("snowflake")*
  • Databricks table ("databricks")*
  • PySpark table ("pyspark")*
  • BigQuery table ("bigquery")*
  • Parquet table ("parquet")*
  • CSV files (string path or pathlib.Path object with .csv extension)
  • Parquet files (string path, pathlib.Path object, glob pattern, directory with .parquet extension, or partitioned dataset)
  • Database connection strings (URI format with optional table specification)

The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, the use of Validate with such tables requires the Ibis library v9.5.0 and above to be installed. If the input table is a Polars or Pandas DataFrame, the Ibis library is not required.

To use a CSV file, ensure that a string or pathlib.Path object with a .csv extension is provided. The file will be automatically detected and loaded using the best available DataFrame library. The loading preference is Polars first, then Pandas as a fallback.

Connection strings follow database URL formats and must also specify a table using the ::table_name suffix. Examples include:

"duckdb:///path/to/database.ddb::table_name"
"sqlite:///path/to/database.db::table_name"
"postgresql://user:password@localhost:5432/database::table_name"
"mysql://user:password@localhost:3306/database::table_name"
"bigquery://project/dataset::table_name"
"snowflake://user:password@account/database/schema::table_name"

When using connection strings, the Ibis library with the appropriate backend driver is required.

Thresholds

The thresholds= parameter is used to set the failure-condition levels for all validation steps. They are set here at the global level but can be overridden at the validation step level (each validation step has its own local thresholds= parameter).

There are three threshold levels: ‘warning’, ‘error’, and ‘critical’. The threshold values can either be set as a proportion failing of all test units (a value between 0 to 1), or, the absolute number of failing test units (as integer that’s 1 or greater).

Thresholds can be defined using one of these input schemes:

  1. use the Thresholds class (the most direct way to create thresholds)
  2. provide a tuple of 1-3 values, where position 0 is the ‘warning’ level, position 1 is the ‘error’ level, and position 2 is the ‘critical’ level
  3. create a dictionary of 1-3 value entries; the valid keys: are ‘warning’, ‘error’, and ‘critical’
  4. a single integer/float value denoting absolute number or fraction of failing test units for the ‘warning’ level only

If the number of failing test units for a validation step exceeds set thresholds, the validation step will be marked as ‘warning’, ‘error’, or ‘critical’. All of the threshold levels don’t need to be set, you’re free to set any combination of them.

Aside from reporting failure conditions, thresholds can be used to determine the actions to take for each level of failure (using the actions= parameter).

Actions

The actions= and final_actions= parameters provide mechanisms to respond to validation results. These actions can be used to notify users of validation failures, log issues, or trigger other processes when problems are detected.

Step Actions

The actions= parameter allows you to define actions that are triggered when validation steps exceed specific threshold levels (warning, error, or critical). These actions are executed during the interrogation process, right after each step is evaluated.

Step actions should be provided using the Actions class, which lets you specify different actions for different severity levels:

# Define an action that logs a message when warning threshold is exceeded
def log_warning():
    metadata = pb.get_action_metadata()
    print(f"WARNING: Step {metadata['step']} failed with type {metadata['type']}")

# Define actions for different threshold levels
actions = pb.Actions(
    warning = log_warning,
    error = lambda: send_email("Error in validation"),
    critical = "CRITICAL FAILURE DETECTED"
)

# Use in Validate
validation = pb.Validate(
    data=my_data,
    actions=actions  # Global actions for all steps
)

You can also provide step-specific actions in individual validation methods:

validation.col_vals_gt(
    columns="revenue",
    value=0,
    actions=pb.Actions(warning=log_warning)  # Only applies to this step
)

Step actions have access to step-specific context through the get_action_metadata() function, which provides details about the current validation step that triggered the action.

Final Actions

The final_actions= parameter lets you define actions that execute after all validation steps have completed. These are useful for providing summaries, sending notifications based on overall validation status, or performing cleanup operations.

Final actions should be provided using the FinalActions class:

def send_report():
    summary = pb.get_validation_summary()
    if summary["status"] == "CRITICAL":
        send_alert_email(
            subject=f"CRITICAL validation failures in {summary['table_name']}",
            body=f"{summary['critical_steps']} steps failed with critical severity."
        )

validation = pb.Validate(
    data=my_data,
    final_actions=pb.FinalActions(send_report)
)

Final actions have access to validation-wide summary information through the get_validation_summary() function, which provides a comprehensive overview of the entire validation process.

The combination of step actions and final actions provides a flexible system for responding to data quality issues at both the individual step level and the overall validation level.

Reporting Languages

Various pieces of reporting in Pointblank can be localized to a specific language. This is done by setting the lang= parameter in Validate. Any of the following languages can be used (just provide the language code):

  • English ("en")
  • French ("fr")
  • German ("de")
  • Italian ("it")
  • Spanish ("es")
  • Portuguese ("pt")
  • Dutch ("nl")
  • Swedish ("sv")
  • Danish ("da")
  • Norwegian Bokmål ("nb")
  • Icelandic ("is")
  • Finnish ("fi")
  • Polish ("pl")
  • Czech ("cs")
  • Romanian ("ro")
  • Greek ("el")
  • Russian ("ru")
  • Turkish ("tr")
  • Arabic ("ar")
  • Hindi ("hi")
  • Simplified Chinese ("zh-Hans")
  • Traditional Chinese ("zh-Hant")
  • Japanese ("ja")
  • Korean ("ko")
  • Vietnamese ("vi")

Automatically generated briefs (produced by using brief=True or brief="...{auto}...") will be written in the selected language. The language setting will also used when generating the validation report table through get_tabular_report() (or printing the Validate object in a notebook environment).

Examples


Creating a validation plan and interrogating

Let’s walk through a data quality analysis of an extremely small table. It’s actually called "small_table" and it’s accessible through the load_dataset() function.

import pointblank as pb

# Load the `small_table` dataset
small_table = pb.load_dataset(dataset="small_table", tbl_type="polars")

# Preview the table
pb.preview(small_table)
PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

We ought to think about what’s tolerable in terms of data quality so let’s designate proportional failure thresholds to the ‘warning’, ‘error’, and ‘critical’ states. This can be done by using the Thresholds class.

thresholds = pb.Thresholds(warning=0.10, error=0.25, critical=0.35)

Now, we use the Validate class and give it the thresholds object (which serves as a default for all validation steps but can be overridden). The static thresholds provided in thresholds= will make the reporting a bit more useful. We also need to provide a target table and we’ll use small_table for this.

validation = (
    pb.Validate(
        data=small_table,
        tbl_name="small_table",
        label="`Validate` example.",
        thresholds=thresholds
    )
)

Then, as with any Validate object, we can add steps to the validation plan by using as many validation methods as we want. To conclude the process (and actually query the data table), we use the interrogate() method.

validation = (
    validation
    .col_vals_gt(columns="d", value=100)
    .col_vals_le(columns="c", value=5)
    .col_vals_between(columns="c", left=3, right=10, na_pass=True)
    .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}")
    .col_exists(columns=["date", "date_time"])
    .interrogate()
)

The validation object can be printed as a reporting table.

validation
Pointblank Validation
`Validate` example.
Polarssmall_tableWARNING0.1ERROR0.25CRITICAL0.35
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()
d 100 13 13
1.00
0
0.00
#FF3300 2
col_vals_le
col_vals_le()
c 5 13 5
0.38
8
0.62
#4CA64C66 3
col_vals_between
col_vals_between()
c [3, 10] 13 12
0.92
1
0.08
#4CA64C 4
col_vals_regex
col_vals_regex()
b [0-9]-[a-z]{3}-[0-9]{3} 13 13
1.00
0
0.00
#4CA64C 5
col_exists
col_exists()
date 1 1
1.00
0
0.00
#4CA64C 6
col_exists
col_exists()
date_time 1 1
1.00
0
0.00
2025-06-22 01:24:55 UTC< 1 s2025-06-22 01:24:55 UTC

The report could be further customized by using the get_tabular_report() method, which contains options for modifying the display of the table.

Adding briefs

Briefs are short descriptions of the validation steps. While they can be set for each step individually, they can also be set globally. The global setting is done by using the brief= argument in Validate. The global setting can be as simple as True to have automatically-generated briefs for each step. Alternatively, we can use templating elements like "{step}" (to insert the step number) or "{auto}" (to include an automatically generated brief). Here’s an example of a global setting for briefs:

validation_2 = (
    pb.Validate(
        data=pb.load_dataset(),
        tbl_name="small_table",
        label="Validation example with briefs",
        brief="Step {step}: {auto}",
    )
    .col_vals_gt(columns="d", value=100)
    .col_vals_between(columns="c", left=3, right=10, na_pass=True)
    .col_vals_regex(
        columns="b",
        pattern=r"[0-9]-[a-z]{3}-[0-9]{3}",
        brief="Regex check for column {col}"
    )
    .interrogate()
)

validation_2
Pointblank Validation
Validation example with briefs
Polarssmall_table
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_gt
col_vals_gt()

Step 1: Expect that values in d should be > 100.

d 100 13 13
1.00
0
0.00
#4CA64C66 2
col_vals_between
col_vals_between()

Step 2: Expect that values in c should be between 3 and 10.

c [3, 10] 13 12
0.92
1
0.08
#4CA64C 3
col_vals_regex
col_vals_regex()

Regex check for column b

b [0-9]-[a-z]{3}-[0-9]{3} 13 13
1.00
0
0.00
2025-06-22 01:24:55 UTC< 1 s2025-06-22 01:24:55 UTC

We see the text of the briefs appear in the STEP column of the reporting table. Furthermore, the global brief’s template ("Step {step}: {auto}") is applied to all steps except for the final step, where the step-level brief= argument provided an override.

If you should want to cancel the globally-defined brief for one or more validation steps, you can set brief=False in those particular steps.

Post-interrogation methods

The Validate class has a number of post-interrogation methods that can be used to extract useful information from the validation results. For example, the get_data_extracts() method can be used to get the data extracts for each validation step.

validation_2.get_data_extracts()
{1: shape: (0, 9)
 ┌───────────┬──────────────┬──────┬─────┬───┬─────┬─────┬──────┬─────┐
 │ _row_num_ ┆ date_time    ┆ date ┆ a   ┆ … ┆ c   ┆ d   ┆ e    ┆ f   │
 │ ---       ┆ ---          ┆ ---  ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ --- │
 │ u32       ┆ datetime[μs] ┆ date ┆ i64 ┆   ┆ i64 ┆ f64 ┆ bool ┆ str │
 ╞═══════════╪══════════════╪══════╪═════╪═══╪═════╪═════╪══════╪═════╡
 └───────────┴──────────────┴──────┴─────┴───┴─────┴─────┴──────┴─────┘,
 2: shape: (1, 9)
 ┌───────────┬─────────────────────┬────────────┬─────┬───┬─────┬─────────┬───────┬─────┐
 │ _row_num_ ┆ date_time           ┆ date       ┆ a   ┆ … ┆ c   ┆ d       ┆ e     ┆ f   │
 │ ---       ┆ ---                 ┆ ---        ┆ --- ┆   ┆ --- ┆ ---     ┆ ---   ┆ --- │
 │ u32       ┆ datetime[μs]        ┆ date       ┆ i64 ┆   ┆ i64 ┆ f64     ┆ bool  ┆ str │
 ╞═══════════╪═════════════════════╪════════════╪═════╪═══╪═════╪═════════╪═══════╪═════╡
 │ 8         ┆ 2016-01-17 11:27:00 ┆ 2016-01-17 ┆ 4   ┆ … ┆ 2   ┆ 1035.64 ┆ false ┆ low │
 └───────────┴─────────────────────┴────────────┴─────┴───┴─────┴─────────┴───────┴─────┘,
 3: shape: (0, 9)
 ┌───────────┬──────────────┬──────┬─────┬───┬─────┬─────┬──────┬─────┐
 │ _row_num_ ┆ date_time    ┆ date ┆ a   ┆ … ┆ c   ┆ d   ┆ e    ┆ f   │
 │ ---       ┆ ---          ┆ ---  ┆ --- ┆   ┆ --- ┆ --- ┆ ---  ┆ --- │
 │ u32       ┆ datetime[μs] ┆ date ┆ i64 ┆   ┆ i64 ┆ f64 ┆ bool ┆ str │
 ╞═══════════╪══════════════╪══════╪═════╪═══╪═════╪═════╪══════╪═════╡
 └───────────┴──────────────┴──────┴─────┴───┴─────┴─────┴──────┴─────┘}

We can also view step reports for each validation step using the get_step_report() method. This method adapts to the type of validation step and shows the relevant information for a step’s validation.

validation_2.get_step_report(i=2)
Report for Validation Step 2
ASSERTION 3 ≤ c ≤ 10
1 / 13 TEST UNIT FAILURES IN COLUMN 5
EXTRACT OF ALL 1 ROWS (WITH TEST UNIT FAILURES IN RED):
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low

The Validate class also has a method for getting the sundered data, which is the data that passed or failed the validation steps. This can be done using the get_sundered_data() method.

pb.preview(validation_2.get_sundered_data())
PolarsRows12Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
8 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
11 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
12 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

The sundered data is a DataFrame that contains the rows that passed or failed the validation. The default behavior is to return the rows that failed the validation, as shown above.

Working with CSV Files

The Validate class can directly accept CSV file paths, making it easy to validate data stored in CSV files without manual loading:

# Get a path to a CSV file from the package data
csv_path = pb.get_data_path("global_sales", "csv")

validation_3 = (
    pb.Validate(
        data=csv_path,
        label="CSV validation example"
    )
    .col_exists(["customer_id", "product_id", "revenue"])
    .col_vals_not_null(["customer_id", "product_id"])
    .col_vals_gt(columns="revenue", value=0)
    .interrogate()
)

validation_3
Pointblank Validation
CSV validation example
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_exists
col_exists()
customer_id 1 1
1.00
0
0.00
#4CA64C 2
col_exists
col_exists()
product_id 1 1
1.00
0
0.00
#4CA64C 3
col_exists
col_exists()
revenue 1 1
1.00
0
0.00
#4CA64C66 4
col_vals_not_null
col_vals_not_null()
customer_id 50.0K 49.7K
0.99
334
0.01
#4CA64C66 5
col_vals_not_null
col_vals_not_null()
product_id 50.0K 49.7K
0.99
335
0.01
#4CA64C 6
col_vals_gt
col_vals_gt()
revenue 0 50.0K 50.0K
1.00
0
0.00
2025-06-22 01:24:56 UTC< 1 s2025-06-22 01:24:56 UTC

You can also use a Path object to specify the CSV file. Here’s an example of how to do that:

from pathlib import Path

csv_file = Path(pb.get_data_path("game_revenue", "csv"))

validation_4 = (
    pb.Validate(data=csv_file, label="Game Revenue Validation")
    .col_exists(["player_id", "session_id", "item_name"])
    .col_vals_regex(
        columns="session_id",
        pattern=r"[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}"
    )
    .col_vals_gt(columns="item_revenue", value=0, na_pass=True)
    .interrogate()
)

validation_4
Pointblank Validation
Game Revenue Validation
Polars
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_exists
col_exists()
player_id 1 1
1.00
0
0.00
#4CA64C 2
col_exists
col_exists()
session_id 1 1
1.00
0
0.00
#4CA64C 3
col_exists
col_exists()
item_name 1 1
1.00
0
0.00
#4CA64C66 4
col_vals_regex
col_vals_regex()
session_id [A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12} 2000 0
0.00
2000
1.00
#4CA64C 5
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
2025-06-22 01:24:56 UTC< 1 s2025-06-22 01:24:56 UTC

The CSV loading is automatic, so when a string or Path with a .csv extension is provided, Pointblank will automatically load the file using the best available DataFrame library (Polars preferred, Pandas as fallback). The loaded data can then be used with all validation methods just like any other supported table type.

Working with Parquet Files

The Validate class can directly accept Parquet files and datasets in various formats. The following examples illustrate how to validate Parquet files:

# Single Parquet file from package data
parquet_path = pb.get_data_path("nycflights", "parquet")

validation_5 = (
    pb.Validate(
        data=parquet_path,
        tbl_name="NYC Flights Data"
    )
    .col_vals_not_null(["carrier", "origin", "dest"])
    .col_vals_gt(columns="distance", value=0)
    .interrogate()
)

validation_5
Pointblank Validation
2025-06-22|01:24:56
PolarsNYC Flights Data
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
carrier 337K 337K
1.00
0
0.00
#4CA64C 2
col_vals_not_null
col_vals_not_null()
origin 337K 337K
1.00
0
0.00
#4CA64C 3
col_vals_not_null
col_vals_not_null()
dest 337K 337K
1.00
0
0.00
#4CA64C 4
col_vals_gt
col_vals_gt()
distance 0 337K 337K
1.00
0
0.00
2025-06-22 01:24:56 UTC< 1 s2025-06-22 01:24:56 UTC

You can also use glob patterns and directories. Here are some examples for how to:

  1. load multiple Parquet files
  2. load a Parquet-containing directory
  3. load a partitioned Parquet dataset
# Multiple Parquet files with glob patterns
validation_6 = pb.Validate(data="data/sales_*.parquet")

# Directory containing Parquet files
validation_7 = pb.Validate(data="parquet_data/")

# Partitioned Parquet dataset
validation_8 = (
    pb.Validate(data="sales_data/")  # Contains year=2023/quarter=Q1/region=US/sales.parquet
    .col_exists(["transaction_id", "amount", "year", "quarter", "region"])
    .interrogate()
)

When you point to a directory that contains a partitioned Parquet dataset (with subdirectories like year=2023/quarter=Q1/region=US/), Pointblank will automatically:

  • discover all Parquet files recursively
  • extract partition column values from directory paths
  • add partition columns to the final DataFrame
  • combine all partitions into a single table for validation

Both Polars and Pandas handle partitioned datasets natively, so this works seamlessly with either DataFrame library. The loading preference is Polars first, then Pandas as a fallback.

Working with Database Connection Strings

The Validate class supports database connection strings for direct validation of database tables. Connection strings must specify a table using the ::table_name suffix:

# Get path to a DuckDB database file from package data
duckdb_path = pb.get_data_path("game_revenue", "duckdb")

validation_9 = (
    pb.Validate(
        data=f"duckdb:///{duckdb_path}::game_revenue",
        label="DuckDB Game Revenue Validation"
    )
    .col_exists(["player_id", "session_id", "item_revenue"])
    .col_vals_gt(columns="item_revenue", value=0)
    .interrogate()
)

validation_9
Pointblank Validation
DuckDB Game Revenue Validation
DuckDB
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_exists
col_exists()
player_id 1 1
1.00
0
0.00
#4CA64C 2
col_exists
col_exists()
session_id 1 1
1.00
0
0.00
#4CA64C 3
col_exists
col_exists()
item_revenue 1 1
1.00
0
0.00
#4CA64C 4
col_vals_gt
col_vals_gt()
item_revenue 0 2000 2000
1.00
0
0.00
2025-06-22 01:24:57 UTC< 1 s2025-06-22 01:24:57 UTC

For comprehensive documentation on supported connection string formats, error handling, and installation requirements, see the connect_to_table() function. This function handles all the connection logic and provides helpful error messages when table specifications are missing or backend dependencies are not installed.