Data Validation
Validating data directly in the terminal with the Pointblank CLI offers a fast, scriptable, and repeatable way to check your data. This approach is especially useful for quick checks, CI/CD pipelines, and automation workflows, where you want immediate feedback and clear pass/fail results.
The CLI commands are designed for efficiency: you can run validations with a single line, integrate them easily into shell scripts or data pipelines, and benefit from clear, color-coded output that’s easy to interpret at a glance.
The pb validate
command lets you perform common validation checks directly on your data source with a simple command-line interface. This works well both for quick, one-off checks and for use in automated pipelines.
For more complex validation logic, the pb run
command serves as a runner for validation scripts written with the Pointblank Python API, allowing you to execute custom validation workflows from the command line.
pb validate
: Quick, One-Line Data Checks
The pb validate
command is your go-to for running common validation checks directly on your data source. It’s perfect for quick, one-off checks or for use in automated pipelines. You specify exactly which check you want to run using the --check
option, making your intent clear and your validation explicit.
Here’s how you construct a validation command:
pb validate worldcities.csv --check <check-name> [other options]
You always provide the data source first, then specify one or more checks with --check
. Each check can have its own options, such as --column
or --value
, depending on what you want to validate.
Checking for Duplicate and Complete Rows
To check for duplicate rows, use the rows-distinct
check:
pb validate worldcities.csv --check rows-distinct
The output shows you whether your data contains any duplicate rows, how many rows were checked, and if any duplicates were found. The color-coding of the results helps you quickly interpret the results, using green for pass and red for fail. Here, no duplicate rows were detected out of the 41K rows checked.
To check that every row is complete (i.e., no missing values in any column), use the rows-complete
check:
pb validate worldcities.csv --check rows-complete
With this check we see that the worldcities.csv
dataset has 739 rows containing at least one Null/missing value. And with any dataset, it’s easy to quickly spot if there are any rows with missing data using this command.
Checking for Nulls and Value Ranges
You can easily check for missing values in a column, or ensure that values fall within a certain range. Here’s how to check that all values in the population
column are not null:
pb validate worldcities.csv --check col-vals-not-null --column city_name
Perhaps surprisingly, we find that one row has a missing city name.
Let’s now check whether all values in the population
column are greater than zero:
pb validate worldcities.csv --check col-vals-gt --column population --value 0
With that we find that there are 741 rows where the population
value is not greater than 0 (note that this check also fails when cells are null or missing).
Multiple Checks in One Command
You can chain several checks together in a single command. This is handy for comprehensive data quality checks:
pb validate worldcities.csv --check rows-distinct --check col-vals-not-null --column city_name --check col-vals-gt --column population --value 0
Each check is shown one after the other in the terminal output, so you can review the result of each validation step individually as the command proceeds.
Seeing and Saving Failing Rows
If a check fails, you might want to see which rows caused the failure. Use the --show-extract
option to display failing rows right in the terminal:
pb validate worldcities.csv --check rows-complete --show-extract
Or, save the failing rows to a CSV file for further investigation:
pb validate worldcities.csv --check rows-complete --show-extract --write-extract incomplete_failing_rows
Note here in the output the additional lines stating that failing rows were saved to a folder (incomplete_failing_rows
) and, within that folder the step_01_rows_complete.csv
file was written. Using a folder for extracts is necessary in practice since there may be multiple validations defined in a pb validate
command.
Advanced Options and CI/CD Integration
- use
--exit-code
to make the command exit with a non-zero code if any check fails; useful for CI/CD pipelines - use
--limit
to control how many failing rows are shown or saved - use
--list-checks
to see all available validation checks and their options
pb validate worldcities.csv --check col-vals-not-null --column city_name --exit-code
pb run
: Custom Validation Workflows with Python
For more complex validation logic, use the pb run
command. This lets you execute a Python script containing Pointblank validation steps, combining the flexibility of the Python API with the convenience of the CLI.
You can always scaffold a template script using the pb make-template
command:
pb make-template my_validation.py
But for our example, we’ll elect to make our own worldcities_validation.py
file from scratch. It will:
- use the
worldcities.csv
file - apply two thresholds (one for ‘warning’, another for ‘error’)
- have six validation steps
Here’s what it looks like:
import pointblank as pb
= (
validation
pb.Validate(="worldcities.csv",
data=pb.Thresholds(
thresholds=1, # 1 failure
warning=0.05, # 5% of rows failing
error
),
)
.col_schema_match(=pb.Schema(
schema=[
columns"city_name", "object"),
("latitude", "float64"),
("longitude", "float64"),
("country", "object"),
("population", "float64"),
(
]
),
)="city_name")
.col_vals_not_null(columns="population")
.col_vals_not_null(columns="population", value=0, na_pass=True)
.col_vals_gt(columns="latitude", left=-90, right=90)
.col_vals_between(columns="longitude", left=-180, right=180)
.col_vals_between(columns
.interrogate() )
Now, we’ll run the .py script from the terminal:
pb run worldcities_validation.py
You’ll see a summary table that lists all of the steps and their results and you can include as many steps and as much logic as you need.
Output Options
You could save the validation report as HTML or JSON (or both) for the purposes of sharing or for automation:
pb run worldcities_validation.py --output-html report.html --output-json report.json
There are also the options to produce extracts (subset of failing rows) with --show-extract
or --write-extract
(just like with pb validate
). Let’s do both in the following example:
pb run worldcities_validation.py --show-extract --write-extract worldcities_failures
This shows a preview of each extract for those validation steps where extracts were produced (steps 2, 3, and 4). Individual CSV files with extracted rows for those steps were written to the worldcities_failures
directory.
Controlling Failure Behavior
It’s possible to use the --fail-on
option to control when the command should exit with an error, based on the severity of validation failures. This is especially useful for automated workflows and CI/CD pipelines.
Let’s try that with our worldcities_validation.py
validation, which we’ve seen exeeds the ‘warning’ in steps 2, 3, and 4:
pb run worldcities_validation.py --fail-on warning
Notice the final line states Exiting with error due to warning, error, or critical validation failures
. Because we applied --fail-on warning
, any presence of `warning’ (or higher levels such as ‘error’ or ‘critical’) will yield a non-zero exit code that should stop a pipeline process. We can prove this by running the following lines in the terminal
pb run worldcities_validation.py --fail-on warning > /dev/null 2>&1
echo $?
which returns 1
.
Wrapping Up
Pointblank’s CLI gives you powerful tools for validating your data, whether you need a quick check or a custom workflow. Use pb validate
for fast, one-liner checks and pb run
for more advanced, scriptable validation logic. With clear output and flexible options, you can catch data issues early and keep your workflows running smoothly.