Validation Methods

Pointblank provides a comprehensive suite of validation methods to verify different aspects of your data. Each method creates a validation step that becomes part of your validation plan.

These validation methods cover everything from checking column values against thresholds to validating the table structure and detecting duplicates. Combined into validation steps, they form the foundation of your data quality workflow.

Pointblank provides over 20 validation methods to handle diverse data quality requirements. These are grouped into three main categories:

Column Value Validations
Row-based Validations
Table Structure Validations

Within each of these categories, we’ll walk through several examples showing how each validation method creates steps in your validation plan.

And we’ll use the small_table dataset for all of our examples. Here’s a preview of it:

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
PolarsRows13Columns8
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05	6	8-kdg-938	3	2343.23	True	high
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
5	2016-01-09 12:36:00	2016-01-09	8	3-ldm-038	7	283.94	True	low
6	2016-01-11 06:15:00	2016-01-11	4	2-dhe-923	4	3291.03	True	mid
7	2016-01-15 18:46:00	2016-01-15	7	1-knw-093	3	843.34	True	high
8	2016-01-17 11:27:00	2016-01-17	4	5-boe-639	2	1035.64	False	low
9	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
10	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
11	2016-01-26 20:07:00	2016-01-26	4	2-dmx-010	7	833.98	True	low
12	2016-01-28 02:51:00	2016-01-28	2	7-dmx-010	8	108.34	False	low
13	2016-01-30 11:23:00	2016-01-30	1	3-dka-303	None	2230.09	True	high

Validation Methods to Validation Steps

In Pointblank, validation methods become validation steps when you add them to a validation plan. Each method creates a distinct step that performs a specific check on your data.

Here’s a simple example showing how three validation methods create three validation steps:

import pointblank as pb

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))

    # Step 1: Check that values in column `a` are greater than 2 ---
    .col_vals_gt(columns="a", value=2, brief="Values in 'a' must exceed 2.")

    # Step 2: Check that column 'date' exists in the table ---
    .col_exists(columns="date", brief="Column 'date' must exist.")

    # Step 3: Check that the table has exactly 13 rows ---
    .row_count_match(count=13, brief="Table should have exactly 13 rows.")
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_gt() Values in 'a' must exceed 2.	a	2	✓	13	9 0.69	4 0.31	—	—	—
#4CA64C	2	col_exists() Column 'date' must exist.	date	—	✓	1	1 1.00	0 0.00	—	—	—	—
#4CA64C	3	row_count_match() Table should have exactly 13 rows.	—	13	✓	1	1 1.00	0 0.00	—	—	—	—

Each validation method produces one step in the validation report above. When combined, these steps form a complete validation plan that systematically checks different aspects of your data quality.

Common Arguments

Most validation methods in Pointblank share a set of common arguments that provide consistency and flexibility across different validation types:

columns=: specifies which column(s) to validate (used in column-based validations)
pre=: allows data transformation before validation
segments=: enables validation across different data subsets
thresholds=: sets acceptable failure thresholds
actions=: defines actions to take when validations fail
brief=: provides a description of what the validation is checking
active=: determines if the validation step should be executed (default is True)
na_pass=: controls how missing values are handled (only for column value validation methods)

For column validation methods, the na_pass= parameter determines whether missing values (Null/None/NA) should pass validation (this parameter is covered in a later section).

These arguments follow a consistent pattern across validation methods, so you don’t need to memorize different parameter sets for each function. This systematic approach makes Pointblank more intuitive to work with as you build increasingly complex validation plans.

We’ll cover most of these common arguments in their own dedicated sections later in the User Guide, as some of them represent a deeper topic worthy of focused attention.

1. Column Value Validations

These methods check individual values within columns against specific criteria:

Comparison checks (col_vals_gt(), col_vals_lt(), etc.) for comparing values to thresholds or other columns
Range checks (col_vals_between(), col_vals_outside()) for verifying that values fall within or outside specific ranges
Set membership checks (col_vals_in_set(), col_vals_not_in_set()) for validating values against predefined sets
Null value checks (col_vals_null(), col_vals_not_null()) for testing presence or absence of null values
Pattern matching checks (col_vals_regex()) for validating text patterns with regular expressions
Custom expression checks (col_vals_expr()) for complex validations using custom expressions

Now let’s look at some key examples from select categories of column value validations.

Comparison Checks

Let’s start with a simple example of how col_vals_gt() might be used to check if the values in a column are greater than a specified value.

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_gt(columns="a", value=5)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_gt()	a	5		✓	13	3 0.23	10 0.77	—	—	—

If you’re checking data in a column that contains Null/None/NA values and you’d like to disregard those values (i.e., let them pass validation), you can use na_pass=True. The following example checks values in column c of small_table, which contains two None values:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_le(columns="c", value=10, na_pass=True)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_le()	c	10		✓	13	13 1.00	0 0.00	—	—	—	—

In the above validation table, we see that all test units passed. If we didn’t use na_pass=True there would be 2 failing test units, one for each None value in the c column.

It’s possible to check against column values against values in an adjacent column. To do this, supply the value= argument with the column name within the col() helper function. Here’s an example of that:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_lt(columns="a", value=pb.col("c"))
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_lt()	a	c		✓	13	6 0.46	7 0.54	—	—	—

This validation checks that values in column a are less than values in column c.

Checking of Missing Values

A very common thing to validate is that there are no Null/NA/missing values in a column. The col_vals_not_null() method checks for the presence of missing values:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_not_null(columns="a")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_not_null()	a	—		✓	13	13 1.00	0 0.00	—	—	—	—

Column a has no missing values and the above validation proves this.

Checking Strings with Regexes

A regular expression (regex) validation via the col_vals_regex() validation method checks if values in a column match a specified pattern. Here’s an example with two validation steps, each checking text values in a column:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_regex(columns="b", pattern=r"^\d-[a-z]{3}-\d{3}$")
    .col_vals_regex(columns="f", pattern=r"high|low|mid")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_regex()	b	^\d-[a-z]{3}-\d{3}$		✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_regex()	f	high\|low\|mid		✓	13	13 1.00	0 0.00	—	—	—	—

Handling Missing Values with `na_pass=`

When validating columns containing Null/None/NA values, you can control how these missing values are treated with the na_pass= parameter:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_vals_le(columns="c", value=10, na_pass=True)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_le()	c	10		✓	13	13 1.00	0 0.00	—	—	—	—

In the above example, column c contains two None values, but all test units pass because we set na_pass=True. Without this setting, those two values would fail the validation.

In summary, na_pass= works like this:

na_pass=True: missing values pass validation regardless of the condition being tested
na_pass=False (the default): missing values fail validation

2. Row-based Validations

Row-based validations focus on examining properties that span across entire rows rather than individual columns. These are essential for detecting issues that can’t be found by looking at columns in isolation:

rows_distinct(): ensures no duplicate rows exist in the table
rows_complete(): verifies that no rows contain any missing values

These row-level validations are particularly valuable for ensuring data integrity and completeness at the record level, which is crucial for many analytical and operational data applications.

Checking Row Distinctness

Here’s an example where we check for duplicate rows with rows_distinct():

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .rows_distinct()
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	rows_distinct()	ALL COLUMNS	—		✓	13	11 0.85	2 0.15	—	—	—

We can also adapt the rows_distinct() check to use a single column or a subset of columns. To do that, we need to use the columns_subset= parameter. Here’s an example of that:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .rows_distinct(columns_subset="b")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	rows_distinct()	b	—		✓	13	11 0.85	2 0.15	—	—	—

Checking Row Completeness

Another important validation is checking for complete rows: rows that have no missing values across all columns or a specified subset of columns. The rows_complete() validation method performs this check.

Here’s an example checking if all rows in the table are complete (have no missing values in any column):

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .rows_complete()
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	rows_complete()	ALL COLUMNS	—		✓	13	11 0.85	2 0.15	—	—	—

As the report indicates, there are some incomplete rows in the table.

3. Table Structure Validations

Table structure validations ensure that the overall architecture of your data meets expectations. These structural checks form a foundation for more detailed data quality assessments:

col_exists(): verifies a column exists in the table
col_schema_match(): ensures table matches a defined schema
col_count_match(): confirms the table has the expected number of columns
row_count_match(): verifies the table has the expected number of rows

These structural validations provide essential checks on the fundamental organization of your data tables, ensuring they have the expected dimensions and components needed for reliable data analysis.

Checking Column Presence

If you need to check for the presence of individual columns, the Validate.col_exists() validation method is useful. In this example, we check whether the date column is present in the table:

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_exists(columns="date")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_exists()	date	—		✓	1	1 1.00	0 0.00	—	—	—	—

That column is present, so the single test unit of this validation step is a passing one.

Checking the Table Schema

For deeper checks of table structure, a schema validation can be performed with the col_schema_match() validation method, where the goal is to check whether the structure of a table matches an expected schema. To define an expected table schema, we need to use the Schema class. Here is a simple example that (1) prepares a schema consisting of column names, (2) uses that schema object in a col_schema_match() validation step:

schema = pb.Schema(columns=["date_time", "date", "a", "b", "c", "d", "e", "f"])

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_schema_match(schema=schema)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_schema_match()	—	SCHEMA		✓	1	1 1.00	0 0.00	—	—	—	—

The col_schema_match() validation step will only have a single test unit (signifying pass or fail). We can see in the above validation report that the column schema validation passed.

More often, a schema will be defined using column names and column types. We can do that by using a list of tuples in the columns= parameter of Schema. Here’s an example of that approach in action:

schema = pb.Schema(
    columns=[
        ("date_time", "Datetime(time_unit='us', time_zone=None)"),
        ("date", "Date"),
        ("a", "Int64"),
        ("b", "String"),
        ("c", "Int64"),
        ("d", "Float64"),
        ("e", "Boolean"),
        ("f", "String"),
    ]
)

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_schema_match(schema=schema)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_schema_match()	—	SCHEMA		✓	1	1 1.00	0 0.00	—	—	—	—

The col_schema_match() validation method has several boolean parameters for making the checks less stringent:

complete=: requires exact column matching (all expected columns must exist, no extra columns allowed)
in_order=: enforces that columns appear in the same order as defined in the schema
case_sensitive_colnames=: column names must match with exact letter case
case_sensitive_dtypes=: data type strings must match with exact letter case

These parameters all default to True, providing strict schema validation. Setting any to False relaxes the validation requirements, making the checks more flexible when exact matching isn’t necessary or practical for your use case.

Checking Counts of Row and Columns

Row and column count validations check the number of rows and columns in a table.

Using row_count_match() checks whether the number of rows in a table matches a specified count.

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .row_count_match(count=13)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	row_count_match()	—	13		✓	1	1 1.00	0 0.00	—	—	—	—

The col_count_match() validation method checks if the number of columns in a table matches a specified count.

(
    pb.Validate(data=pb.load_dataset(dataset="small_table", tbl_type="polars"))
    .col_count_match(count=8)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_count_match()	—	8		✓	1	1 1.00	0 0.00	—	—	—	—

Expectations on column and row counts can be useful in certain situations and they align nicely with schema checks.

Conclusion

In this article, we’ve explored the various types of validation methods that Pointblank offers for ensuring data quality. These methods provide a framework for validating column values, checking row properties, and verifying table structures. By combining these validation methods into comprehensive plans, you can systematically test your data against business rules and quality expectations. And this all helps to ensure your data remains reliable and trustworthy.