PolarsRows13Columns8 |
||||||||
Validation Types
The collection of validation methods in Pointblank allows you to express all sorts of checks on your DataFrames and database tables. We’ll use the small_table
dataset for all of the examples shown here. Here’s a preview of it:
Column Value Comparison Checks
Column value comparison checks work by comparing values in a column against a fixed value, or, values in a different column. Here are all the validation methods of this type and comparison checks they make:
col_vals_gt()
–>
col_vals_lt()
–<
col_vals_ge()
–>=
col_vals_le()
–<=
col_vals_eq()
–==
col_vals_ne()
–!=
These types of checks are typically numerical checks, so the column supplied in columns=
will be numeric along with the value or column referenced in value=
.
Here’s an example using col_vals_gt()
, which checks if the values in a column are greater than a specified value.
(=pb.load_dataset("small_table"))
pb.Validate(data="a", value=5)
.col_vals_gt(columns
.interrogate() )
If you’re checking data in a column that contains null/None
/NA
values and you’d like to disregard those values (i.e., let them pass validation), you can use na_pass=True
. The following example checks values in column c
of small_table
, which contains two None
values:
(=pb.load_dataset("small_table"))
pb.Validate(data="c", value=10, na_pass=True)
.col_vals_le(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
In the above validation table, we see that all test units passed. If we didn’t use na_pass=True
there would be 2 failing test units, one for each None
value in the c
column.
It’s possible to check against column values against values in an adjacent column. To do this, supply the value=
argument with the column name within the col()
helper function. Here’s an example of that:
(=pb.load_dataset("small_table"))
pb.Validate(data="a", value=pb.col("c"))
.col_vals_lt(columns
.interrogate() )
This validation checks that values in column a
are less than values in column c
.
Column Value Range Checks
A range check determines whether values in a column are within a fixed range or a range that depends on values in adjacent columns. The col_vals_between()
validation method makes this type of check possible. Here is an example of its use:
(=pb.load_dataset("small_table"))
pb.Validate(data="a", left=2, right=10)
.col_vals_between(columns
.interrogate() )
Note that in the validation report, we can see [2, 10]
in the VALUES
column. This notation with square brackets indicates that the bounds of the range are inclusive. If this is not what you want, the inclusivity of both bounds can be modified via the inclusive=
parameter.
The left=
and right=
arguments can accept fixed values or references to other columns in the table via the col()
helper function. In this example, we can illustrate three variations of this:
(=pb.load_dataset("small_table"))
pb.Validate(data="a", left=pb.col("c"), right=10)
.col_vals_between(columns="a", left=2, right=pb.col("d"))
.col_vals_between(columns="a", left=pb.col("c"), right=pb.col("d"))
.col_vals_between(columns
.interrogate() )
There is a related validation method that checks whether column values lie outside a range. That one is called col_vals_outside()
:
(=pb.load_dataset("small_table"))
pb.Validate(data="a", left=1, right=4)
.col_vals_outside(columns
.interrogate() )
The col_vals_outside()
validation method has the same set of parameters as col_vals_between()
, the major difference with the former is that any values found within the bounds will be considered as failing test units.
Set Membership Validations
The set membership validation method col_vals_in_set()
checks whether values in a column are part of a specified set. Here’s an example where the expectation is that values in column a
can only have numeric values in the set of [1, 2, 3, 4, 5]
:
(=pb.load_dataset("small_table"))
pb.Validate(data="a", set=[1, 2, 3, 4, 5])
.col_vals_in_set(columns
.interrogate() )
The above example operates on a numeric column. We can also perform a set membership check on a string-based column. Here’s an example that checks string values in the f
column of small_table
:
(=pb.load_dataset("small_table"))
pb.Validate(data="f", set=["low", "mid", "high"])
.col_vals_in_set(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
The col_vals_not_in_set()
validation method does what you might expect, which is checking whether values in a column are not part of a fixed set. Here’s an example of its use:
(=pb.load_dataset("small_table"))
pb.Validate(data="f", set=["small", "large"])
.col_vals_in_set(columns
.interrogate() )
This validation passes completely because no values in f
are "small"
or "large"
.
Checks for Missing Values
Missing values validations check for the presence or absence of null/None
/NA
values in a column. The two validation methods of this type are:
Let’s look at an example that checks that there should not be any missing values:
(=pb.load_dataset("small_table"))
pb.Validate(data="a")
.col_vals_not_null(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
Validation of Table Structure
Schema validation can be performed with the col_schema_match()
validation method and the goal is to check whether the structure of a table matches an expected schema. To define an expected table schema, we need to use the Schema
class. Here is a simple example that (1) prepares a schema consisting of column names, (2) using that schema
object in a col_schema_match()
validation step:
= pb.Schema(columns=["date_time", "date", "a", "b", "c", "d", "e", "f"])
schema
(=pb.load_dataset("small_table"))
pb.Validate(data=schema)
.col_schema_match(schema
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
Any col_schema_match()
validation step will only have a single test unit (signifying pass or fail). We can see in the above validation report that the column schema validation passed.
More often a schema will be defined using column names and column types. We can do that by using a list of tuples in the columns=
parameter of Schema
. Here’s an example of that approach in action:
= pb.Schema(
schema =[
columns"date_time", "Datetime(time_unit='us', time_zone=None)"),
("date", "Date"),
("a", "Int64"),
("b", "String"),
("c", "Int64"),
("d", "Float64"),
("e", "Boolean"),
("f", "String"),
(
]
)
(=pb.load_dataset("small_table"))
pb.Validate(data=schema)
.col_schema_match(schema
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
The col_schema_match()
validation method has several boolean parameters for making the checks less stringent:
complete=
in_order=
case_sensitive_colnames=
case_sensitive_dtypes=
All of these are True
by default but modifying these settings can make for more convenient checks when absolute strictness isn’t practical.
Should you need to check for the presence of individual columns (regardless of type), the col_exists()
validation method is useful. In this example, we check whether the date
column is present in the table:
(=pb.load_dataset("small_table"))
pb.Validate(data="date")
.col_exists(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
That column is present, so the single test unit of this validation step is a passing one.
Row and Column Count Validations
Row and column count validations check the number of rows and columns in a table.
Using row_count_match()
checks whether the number of rows in a table matches a specified count.
(=pb.load_dataset("small_table"))
pb.Validate(data=13)
.row_count_match(count
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
The col_count_match()
validation method checks if the number of columns in a table matches a specified count.
(=pb.load_dataset("small_table"))
pb.Validate(data=8)
.col_count_match(count
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 1 | 1 1.00 |
0 0.00 |
— | — | — | — |
Distinct Row Validations
We can check for duplicate rows in a table with rows_distinct()
.
(=pb.load_dataset("small_table"))
pb.Validate(data
.rows_distinct()
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C66 | 1 |
|
✓ | 13 | 11 0.85 |
2 0.15 |
— | — | — | — |
We can also adapt the rows_distinct()
check to use a single column or a subset of columns. To do that, we need to use the columns_subset=
parameter. Here’s an example of that:
(=pb.load_dataset("small_table"))
pb.Validate(data="b")
.rows_distinct(columns_subset
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C66 | 1 |
|
✓ | 13 | 11 0.85 |
2 0.15 |
— | — | — | — |
Checking Strings in a Column with a Regular Expression
A regular expression (regex) validation via the col_vals_regex()
validation method checks if values in a column match a specified pattern. Here’s an example with two validation steps:
(=pb.load_dataset("small_table"))
pb.Validate(data="b", pattern=r"^\d-[a-z]{3}-\d{3}$")
.col_vals_regex(columns="f", pattern=r"high|low|mid")
.col_vals_regex(columns
.interrogate() )
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — | |||
#4CA64C | 2 |
|
✓ | 13 | 13 1.00 |
0 0.00 |
— | — | — | — |
Both of these regular expression checks resulted in complete matches down columns b
and f
.
More Information
These are just a few examples of the many validation methods available in Pointblank. For more detailed information, check out the individual reference pages in the API Reference.