Data Validation and Automation
Pointblank’s CLI makes it easy to validate your data directly from the terminal. This is ideal for quick checks, CI/CD pipelines, and automation workflows. The pb run
command serves as a runner for validation scripts written with the Pointblank Python API, allowing you to execute more complex validation logic from the command line.
Supported Data Sources
You can validate a wide variety of data sources using the CLI:
- CSV files: single files, glob patterns
- Parquet files: single files, directories, partitioned datasets
- GitHub URLs: for CSV/Parquet files (standard or raw URLs)
- database tables: via connection strings
- built-in datasets: provided by Pointblank
Quick Reference for the Data Validation Commands
Command | Purpose |
---|---|
pb validate |
Run validation checks on your data |
pb run |
Run a Python validation script from the CLI |
pb make-template |
Generate a validation script template |
pb validate
: Run Validation Checks
Use pb validate
to perform one or more validation checks on your data source. Here’s the basic usage pattern:
pb validate [DATA_SOURCE] --check [CHECK_TYPE] [OPTIONS]
Here are the supported checks and the required options in parentheses:
rows-distinct
: check for duplicate rows (default)rows-complete
: check for missing values in any columncol-exists
: check if a column exists (--column
)col-vals-not-null
: check if a column has no null values (--column
)col-vals-gt
: column values greater than a value (--column
,--value
)col-vals-ge
: column values greater than or equal to a value (--column
,--value
)col-vals-lt
: column values less than a value (--column
,--value
)col-vals-le
: column values less than or equal to a value (--column
,--value
)col-vals-in-set
: column values must be in a set (--column
,--set
)
Here are a few examples:
# Check for duplicate rows (default)
pb validate data.csv
# Check if all values in 'age' are not null
pb validate data.csv --check col-vals-not-null --column age
# Check if all values in 'score' are greater than 50
pb validate data.csv --check col-vals-gt --column score --value 50
# Check if 'status' values are in a set
pb validate data.csv --check col-vals-in-set --column status --set "active,inactive,pending"
Several consecutive checks can be performed in one command! To do this, use --check
multiple times (along with the check type and its required options).
# Perform multiple checks in one command
pb validate data.csv --check rows-distinct --check col-vals-not-null --column id
There are several useful options:
--show-extract
: show failing rows if validation fails- `–write-extract TEXT: save failing rows to a directory as CSV
--limit INTEGER
: limit the number of failing rows shown/saved (default: 10)--exit-code
: exit with non-zero code if validation fails (for CI/CD)--list-checks
: list all available validation checks
pb run
: Run Python Validation Scripts
Use pb run
to execute a Python script containing Pointblank validation logic. This is useful for more complex validations or automation.
pb run validation_script.py
Here are the options:
--data [DATA_SOURCE]
: override the data source in your script (available ascli_data
)--output-html file.html
: save validation report as HTML--output-json file.json
: save validation summary as JSON--show-extract
: show failing rows for failed steps--write-extract TEXT
: save failing rows for each step as CSVs in a directory--limit INTEGER
: limit the number of failing rows shown/saved--fail-on [critical|error|warning|any]
: exit with error if any step meets/exceeds this severity
Here’s an example where we:
- override the input data in the script
- output the validation report table to a file
- signal failure (i.e., exit with non-zero code) if any ‘error’ threshold is met
pb run my_validation.py --data data.csv --output-html report.html --fail-on error
To scaffold a .py file for this, use pb make-template
.
pb make-template
: Generate a Validation Script Template
Use this command to create a starter Python script for Pointblank validation:
pb make-template my_validation.py
Edit the generated script to add your own data loading and validation rules, then run it with pb run
.
Integration with CI/CD
Validation through the CLI provide opportunities for automation. The following features lend themselves well to automated processes:
- validation results are shown in a clear, color-coded table
--show-extract
or--write-extract
can be used to debug failing rows- the
--exit-code
or--fail-on
options are ideal for CI/CD integration.
Here’s an example of how one might integrate data validation through the Pointblank CLI into a CI/CD pipeline:
# Example GitHub Actions workflow
name: Data Validation
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dependencies
run: pip install pointblank
- name: Validate data quality
run: |
pb validate data/sales.csv --check rows-distinct --exit-code
pb validate data/sales.csv --check col-vals-not-null --column customer_id --exit-code pb validate data/sales.csv --check col-vals-gt --column amount --value 0 --exit-code
Some Useful Tips
- use
pb validate --list-checks
to see all available checks and usage examples. - use
pb run
for advanced validation logic or when you need to chain multiple steps. - use
pb make-template
to quickly scaffold new validation scripts.
For more CLI usage examples and real terminal output, see the CLI Demos.