Validation Types

The collection of validation methods in Pointblank allows you to express all sorts of checks on your DataFrames and database tables. We’ll use the small_table dataset for all of the examples shown here. Here’s a preview of it:

	date_time Datetime	date Date	a Int64	b String	c Int64	d Float64	e Boolean	f String
PolarsRows13Columns8
1	2016-01-04 11:00:00	2016-01-04	2	1-bcd-345	3	3423.29	True	high
2	2016-01-04 00:32:00	2016-01-04	3	5-egh-163	8	9999.99	True	low
3	2016-01-05 13:32:00	2016-01-05	6	8-kdg-938	3	2343.23	True	high
4	2016-01-06 17:23:00	2016-01-06	2	5-jdo-903	None	3892.4	False	mid
5	2016-01-09 12:36:00	2016-01-09	8	3-ldm-038	7	283.94	True	low
6	2016-01-11 06:15:00	2016-01-11	4	2-dhe-923	4	3291.03	True	mid
7	2016-01-15 18:46:00	2016-01-15	7	1-knw-093	3	843.34	True	high
8	2016-01-17 11:27:00	2016-01-17	4	5-boe-639	2	1035.64	False	low
9	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
10	2016-01-20 04:30:00	2016-01-20	3	5-bce-642	9	837.93	False	high
11	2016-01-26 20:07:00	2016-01-26	4	2-dmx-010	7	833.98	True	low
12	2016-01-28 02:51:00	2016-01-28	2	7-dmx-010	8	108.34	False	low
13	2016-01-30 11:23:00	2016-01-30	1	3-dka-303	None	2230.09	True	high

Column Value Comparison Checks

Column value comparison checks work by comparing values in a column against a fixed value, or, values in a different column. Here are all the validation methods of this type and comparison checks they make:

col_vals_gt() – >
col_vals_lt() – <
col_vals_ge() – >=
col_vals_le() – <=
col_vals_eq() – ==
col_vals_ne() – !=

These types of checks are typically numerical checks, so the column supplied in columns= will be numeric along with the value or column referenced in value=.

Here’s an example using col_vals_gt(), which checks if the values in a column are greater than a specified value.

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(columns="a", value=5)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_gt()	a	5		✓	13	3 0.23	10 0.77	—	—	—

If you’re checking data in a column that contains null/None/NA values and you’d like to disregard those values (i.e., let them pass validation), you can use na_pass=True. The following example checks values in column c of small_table, which contains two None values:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_le(columns="c", value=10, na_pass=True)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_le()	c	10		✓	13	13 1.00	0 0.00	—	—	—	—

In the above validation table, we see that all test units passed. If we didn’t use na_pass=True there would be 2 failing test units, one for each None value in the c column.

It’s possible to check against column values against values in an adjacent column. To do this, supply the value= argument with the column name within the col() helper function. Here’s an example of that:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_lt(columns="a", value=pb.col("c"))
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_lt()	a	c		✓	13	6 0.46	7 0.54	—	—	—

This validation checks that values in column a are less than values in column c.

Column Value Range Checks

A range check determines whether values in a column are within a fixed range or a range that depends on values in adjacent columns. The col_vals_between() validation method makes this type of check possible. Here is an example of its use:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(columns="a", left=2, right=10)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_between()	a	[2, 10]		✓	13	12 0.92	1 0.08	—	—	—

Note that in the validation report, we can see [2, 10] in the VALUES column. This notation with square brackets indicates that the bounds of the range are inclusive. If this is not what you want, the inclusivity of both bounds can be modified via the inclusive= parameter.

The left= and right= arguments can accept fixed values or references to other columns in the table via the col() helper function. In this example, we can illustrate three variations of this:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(columns="a", left=pb.col("c"), right=10)
    .col_vals_between(columns="a", left=2, right=pb.col("d"))
    .col_vals_between(columns="a", left=pb.col("c"), right=pb.col("d"))
    .interrogate()
)

		STEP	COLUMNS	VALUES	EVAL	UNITS	PASS	FAIL	W	E	C
#4CA64C66	1	col_vals_between()	a	[c, 10]	✓	13	5 0.38	8 0.62	—	—	—
#4CA64C66	2	col_vals_between()	a	[2, d]	✓	13	12 0.92	1 0.08	—	—	—
#4CA64C66	3	col_vals_between()	a	[c, d]	✓	13	5 0.38	8 0.62	—	—	—

There is a related validation method that checks whether column values lie outside a range. That one is called col_vals_outside():

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_outside(columns="a", left=1, right=4)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_outside()	a	[1, 4]		✓	13	3 0.23	10 0.77	—	—	—

The col_vals_outside() validation method has the same set of parameters as col_vals_between(), the major difference with the former is that any values found within the bounds will be considered as failing test units.

Set Membership Validations

The set membership validation method col_vals_in_set() checks whether values in a column are part of a specified set. Here’s an example where the expectation is that values in column a can only have numeric values in the set of [1, 2, 3, 4, 5]:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_in_set(columns="a", set=[1, 2, 3, 4, 5])
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_in_set()	a	1, 2, 3, 4, 5		✓	13	10 0.77	3 0.23	—	—	—

The above example operates on a numeric column. We can also perform a set membership check on a string-based column. Here’s an example that checks string values in the f column of small_table:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_in_set()	f	low, mid, high		✓	13	13 1.00	0 0.00	—	—	—	—

The col_vals_not_in_set() validation method does what you might expect, which is checking whether values in a column are not part of a fixed set. Here’s an example of its use:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_in_set(columns="f", set=["small", "large"])
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	col_vals_in_set()	f	small, large		✓	13	0 0.00	13 1.00	—	—	—

This validation passes completely because no values in f are "small" or "large".

Checks for Missing Values

Missing values validations check for the presence or absence of null/None/NA values in a column. The two validation methods of this type are:

Let’s look at an example that checks that there should not be any missing values:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_not_null(columns="a")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_not_null()	a	—		✓	13	13 1.00	0 0.00	—	—	—	—

Validation of Table Structure

Schema validation can be performed with the col_schema_match() validation method and the goal is to check whether the structure of a table matches an expected schema. To define an expected table schema, we need to use the Schema class. Here is a simple example that (1) prepares a schema consisting of column names, (2) using that schema object in a col_schema_match() validation step:

schema = pb.Schema(columns=["date_time", "date", "a", "b", "c", "d", "e", "f"])

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_schema_match(schema=schema)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_schema_match()	—	SCHEMA		✓	1	1 1.00	0 0.00	—	—	—	—

Any col_schema_match() validation step will only have a single test unit (signifying pass or fail). We can see in the above validation report that the column schema validation passed.

More often a schema will be defined using column names and column types. We can do that by using a list of tuples in the columns= parameter of Schema. Here’s an example of that approach in action:

schema = pb.Schema(
    columns=[
        ("date_time", "Datetime(time_unit='us', time_zone=None)"),
        ("date", "Date"),
        ("a", "Int64"),
        ("b", "String"),
        ("c", "Int64"),
        ("d", "Float64"),
        ("e", "Boolean"),
        ("f", "String"),
    ]
)

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_schema_match(schema=schema)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_schema_match()	—	SCHEMA		✓	1	1 1.00	0 0.00	—	—	—	—

The col_schema_match() validation method has several boolean parameters for making the checks less stringent:

complete=
in_order=
case_sensitive_colnames=
case_sensitive_dtypes=

All of these are True by default but modifying these settings can make for more convenient checks when absolute strictness isn’t practical.

Should you need to check for the presence of individual columns (regardless of type), the col_exists() validation method is useful. In this example, we check whether the date column is present in the table:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_exists(columns="date")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_exists()	date	—		✓	1	1 1.00	0 0.00	—	—	—	—

That column is present, so the single test unit of this validation step is a passing one.

Row and Column Count Validations

Row and column count validations check the number of rows and columns in a table.

Using row_count_match() checks whether the number of rows in a table matches a specified count.

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .row_count_match(count=13)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	row_count_match()	—	13		✓	1	1 1.00	0 0.00	—	—	—	—

The col_count_match() validation method checks if the number of columns in a table matches a specified count.

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_count_match(count=8)
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_count_match()	—	8		✓	1	1 1.00	0 0.00	—	—	—	—

Distinct Row Validations

We can check for duplicate rows in a table with rows_distinct().

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .rows_distinct()
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	rows_distinct()	ALL COLUMNS	—		✓	13	11 0.85	2 0.15	—	—	—	—

We can also adapt the rows_distinct() check to use a single column or a subset of columns. To do that, we need to use the columns_subset= parameter. Here’s an example of that:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .rows_distinct(columns_subset="b")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C66	1	rows_distinct()	b	—		✓	13	11 0.85	2 0.15	—	—	—	—

Checking Strings in a Column with a Regular Expression

A regular expression (regex) validation via the col_vals_regex() validation method checks if values in a column match a specified pattern. Here’s an example with two validation steps:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_regex(columns="b", pattern=r"^\d-[a-z]{3}-\d{3}$")
    .col_vals_regex(columns="f", pattern=r"high|low|mid")
    .interrogate()
)

		STEP	COLUMNS	VALUES	TBL	EVAL	UNITS	PASS	FAIL	W	E	C	EXT
#4CA64C	1	col_vals_regex()	b	^\d-[a-z]{3}-\d{3}$		✓	13	13 1.00	0 0.00	—	—	—	—
#4CA64C	2	col_vals_regex()	f	high\|low\|mid		✓	13	13 1.00	0 0.00	—	—	—	—

Both of these regular expression checks resulted in complete matches down columns b and f.

More Information

These are just a few examples of the many validation methods available in Pointblank. For more detailed information, check out the individual reference pages in the API Reference.