Validation Types

The collection of validation methods in Pointblank allows you to express all sorts of checks on your DataFrames and database tables. We’ll use the small_table dataset for all of the examples shown here. Here’s a preview of it:

PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843.34 True high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

Column Value Comparison Checks

Column value comparison checks work by comparing values in a column against a fixed value, or, values in a different column. Here are all the validation methods of this type and comparison checks they make:

These types of checks are typically numerical checks, so the column supplied in columns= will be numeric along with the value or column referenced in value=.

Here’s an example using col_vals_gt(), which checks if the values in a column are greater than a specified value.

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_gt(columns="a", value=5)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gt
col_vals_gt()
a 5 13 3
0.23
10
0.77

If you’re checking data in a column that contains null/None/NA values and you’d like to disregard those values (i.e., let them pass validation), you can use na_pass=True. The following example checks values in column c of small_table, which contains two None values:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_le(columns="c", value=10, na_pass=True)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_lte
col_vals_le()
c 10 13 13
1.00
0
0.00

In the above validation table, we see that all test units passed. If we didn’t use na_pass=True there would be 2 failing test units, one for each None value in the c column.

It’s possible to check against column values against values in an adjacent column. To do this, supply the value= argument with the column name within the col() helper function. Here’s an example of that:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_lt(columns="a", value=pb.col("c"))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_lt
col_vals_lt()
a c 13 6
0.46
7
0.54

This validation checks that values in column a are less than values in column c.

Column Value Range Checks

A range check determines whether values in a column are within a fixed range or a range that depends on values in adjacent columns. The col_vals_between() validation method makes this type of check possible. Here is an example of its use:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(columns="a", left=2, right=10)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_between
col_vals_between()
a [2, 10] 13 12
0.92
1
0.08

Note that in the validation report, we can see [2, 10] in the VALUES column. This notation with square brackets indicates that the bounds of the range are inclusive. If this is not what you want, the inclusivity of both bounds can be modified via the inclusive= parameter.

The left= and right= arguments can accept fixed values or references to other columns in the table via the col() helper function. In this example, we can illustrate three variations of this:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_between(columns="a", left=pb.col("c"), right=10)
    .col_vals_between(columns="a", left=2, right=pb.col("d"))
    .col_vals_between(columns="a", left=pb.col("c"), right=pb.col("d"))
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_between
col_vals_between()
a [c, 10] 13 5
0.38
8
0.62
#4CA64C66 2
col_vals_between
col_vals_between()
a [2, d] 13 12
0.92
1
0.08
#4CA64C66 3
col_vals_between
col_vals_between()
a [c, d] 13 5
0.38
8
0.62

There is a related validation method that checks whether column values lie outside a range. That one is called col_vals_outside():

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_outside(columns="a", left=1, right=4)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_not_between
col_vals_outside()
a [1, 4] 13 3
0.23
10
0.77

The col_vals_outside() validation method has the same set of parameters as col_vals_between(), the major difference with the former is that any values found within the bounds will be considered as failing test units.

Set Membership Validations

The set membership validation method col_vals_in_set() checks whether values in a column are part of a specified set. Here’s an example where the expectation is that values in column a can only have numeric values in the set of [1, 2, 3, 4, 5]:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_in_set(columns="a", set=[1, 2, 3, 4, 5])
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_in_set
col_vals_in_set()
a 1, 2, 3, 4, 5 13 10
0.77
3
0.23

The above example operates on a numeric column. We can also perform a set membership check on a string-based column. Here’s an example that checks string values in the f column of small_table:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_in_set(columns="f", set=["low", "mid", "high"])
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_in_set
col_vals_in_set()
f low, mid, high 13 13
1.00
0
0.00

The col_vals_not_in_set() validation method does what you might expect, which is checking whether values in a column are not part of a fixed set. Here’s an example of its use:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_in_set(columns="f", set=["small", "large"])
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_in_set
col_vals_in_set()
f small, large 13 0
0.00
13
1.00

This validation passes completely because no values in f are "small" or "large".

Checks for Missing Values

Missing values validations check for the presence or absence of null/None/NA values in a column. The two validation methods of this type are:

Let’s look at an example that checks that there should not be any missing values:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_not_null(columns="a")
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_not_null
col_vals_not_null()
a 13 13
1.00
0
0.00

Validation of Table Structure

Schema validation can be performed with the col_schema_match() validation method and the goal is to check whether the structure of a table matches an expected schema. To define an expected table schema, we need to use the Schema class. Here is a simple example that (1) prepares a schema consisting of column names, (2) using that schema object in a col_schema_match() validation step:

schema = pb.Schema(columns=["date_time", "date", "a", "b", "c", "d", "e", "f"])

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_schema_match(schema=schema)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00

Any col_schema_match() validation step will only have a single test unit (signifying pass or fail). We can see in the above validation report that the column schema validation passed.

More often a schema will be defined using column names and column types. We can do that by using a list of tuples in the columns= parameter of Schema. Here’s an example of that approach in action:

schema = pb.Schema(
    columns=[
        ("date_time", "Datetime(time_unit='us', time_zone=None)"),
        ("date", "Date"),
        ("a", "Int64"),
        ("b", "String"),
        ("c", "Int64"),
        ("d", "Float64"),
        ("e", "Boolean"),
        ("f", "String"),
    ]
)

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_schema_match(schema=schema)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_schema_match
col_schema_match()
SCHEMA 1 1
1.00
0
0.00

The col_schema_match() validation method has several boolean parameters for making the checks less stringent:

  • complete=
  • in_order=
  • case_sensitive_colnames=
  • case_sensitive_dtypes=

All of these are True by default but modifying these settings can make for more convenient checks when absolute strictness isn’t practical.

Should you need to check for the presence of individual columns (regardless of type), the col_exists() validation method is useful. In this example, we check whether the date column is present in the table:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_exists(columns="date")
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_exists
col_exists()
date 1 1
1.00
0
0.00

That column is present, so the single test unit of this validation step is a passing one.

Row and Column Count Validations

Row and column count validations check the number of rows and columns in a table.

Using row_count_match() checks whether the number of rows in a table matches a specified count.

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .row_count_match(count=13)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
row_count_match
row_count_match()
13 1 1
1.00
0
0.00

The col_count_match() validation method checks if the number of columns in a table matches a specified count.

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_count_match(count=8)
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_count_match
col_count_match()
8 1 1
1.00
0
0.00

Distinct Row Validations

We can check for duplicate rows in a table with rows_distinct().

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .rows_distinct()
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
rows_distinct
rows_distinct()
ALL COLUMNS 13 11
0.85
2
0.15

We can also adapt the rows_distinct() check to use a single column or a subset of columns. To do that, we need to use the columns_subset= parameter. Here’s an example of that:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .rows_distinct(columns_subset="b")
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
rows_distinct
rows_distinct()
b 13 11
0.85
2
0.15

Checking Strings in a Column with a Regular Expression

A regular expression (regex) validation via the col_vals_regex() validation method checks if values in a column match a specified pattern. Here’s an example with two validation steps:

(
    pb.Validate(data=pb.load_dataset("small_table"))
    .col_vals_regex(columns="b", pattern=r"^\d-[a-z]{3}-\d{3}$")
    .col_vals_regex(columns="f", pattern=r"high|low|mid")
    .interrogate()
)
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C 1
col_vals_regex
col_vals_regex()
b ^\d-[a-z]{3}-\d{3}$ 13 13
1.00
0
0.00
#4CA64C 2
col_vals_regex
col_vals_regex()
f high|low|mid 13 13
1.00
0
0.00

Both of these regular expression checks resulted in complete matches down columns b and f.

More Information

These are just a few examples of the many validation methods available in Pointblank. For more detailed information, check out the individual reference pages in the API Reference.