Sundering Data

Sundering data? First off, let’s get the correct meaning across here. Sundering is really just splitting, dividing, cutting into two pieces. And it’s a useful thing we can do in Pointblank to any data that we are validating. When you interrogate the data, you learn about which rows have test failures within them. With more validation steps, we get an even better picture of this simply by virtue of more testing.

Let’s use the small_table in our examples to show just how sundering is done. Here’s that table:

PolarsRows13Columns8
date_time
Datetime
date
Date
a
Int64
b
String
c
Int64
d
Float64
e
Boolean
f
String
1 2016-01-04 11:00:00 2016-01-04 2 1-bcd-345 3 3423.29 True high
2 2016-01-04 00:32:00 2016-01-04 3 5-egh-163 8 9999.99 True low
3 2016-01-05 13:32:00 2016-01-05 6 8-kdg-938 3 2343.23 True high
4 2016-01-06 17:23:00 2016-01-06 2 5-jdo-903 None 3892.4 False mid
5 2016-01-09 12:36:00 2016-01-09 8 3-ldm-038 7 283.94 True low
6 2016-01-11 06:15:00 2016-01-11 4 2-dhe-923 4 3291.03 True mid
7 2016-01-15 18:46:00 2016-01-15 7 1-knw-093 3 843.34 True high
8 2016-01-17 11:27:00 2016-01-17 4 5-boe-639 2 1035.64 False low
9 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
10 2016-01-20 04:30:00 2016-01-20 3 5-bce-642 9 837.93 False high
11 2016-01-26 20:07:00 2016-01-26 4 2-dmx-010 7 833.98 True low
12 2016-01-28 02:51:00 2016-01-28 2 7-dmx-010 8 108.34 False low
13 2016-01-30 11:23:00 2016-01-30 1 3-dka-303 None 2230.09 True high

A Simple Example Where Data is Torn Asunder

We’ll begin with a very simple validation plan, having only a single step. There will be failing test units here.

validation = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_ge(columns="d", value=1000)
    .interrogate()
)

validation
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gte
col_vals_ge()
d 1000 13 7
0.54
6
0.46

We see six failing test units in FAIL column of the above validation report table. There is a data extract (collection of failing rows) available. Let’s use the get_data_extracts() method to have a look at it.

validation.get_data_extracts(i=1, frame=True)
shape: (6, 9)
_row_num_date_timedateabcdef
u32datetime[μs]datei64stri64f64boolstr
52016-01-09 12:36:002016-01-098"3-ldm-038"7283.94true"low"
72016-01-15 18:46:002016-01-157"1-knw-093"3843.34true"high"
92016-01-20 04:30:002016-01-203"5-bce-642"9837.93false"high"
102016-01-20 04:30:002016-01-203"5-bce-642"9837.93false"high"
112016-01-26 20:07:002016-01-264"2-dmx-010"7833.98true"low"
122016-01-28 02:51:002016-01-282"7-dmx-010"8108.34false"low"

This is six rows of data that had failing test units in column d. Indeed we can see that all values in that column are less than 1000 (and we asserted that values should be greater than or equal to 1000). This is the ‘bad’ data, if you will. Using the get_sundered_data() method, we get the ‘good’ part:

validation.get_sundered_data()
shape: (7, 8)
date_timedateabcdef
datetime[μs]datei64stri64f64boolstr
2016-01-04 11:00:002016-01-042"1-bcd-345"33423.29true"high"
2016-01-04 00:32:002016-01-043"5-egh-163"89999.99true"low"
2016-01-05 13:32:002016-01-056"8-kdg-938"32343.23true"high"
2016-01-06 17:23:002016-01-062"5-jdo-903"null3892.4false"mid"
2016-01-11 06:15:002016-01-114"2-dhe-923"43291.03true"mid"
2016-01-17 11:27:002016-01-174"5-boe-639"21035.64false"low"
2016-01-30 11:23:002016-01-301"3-dka-303"null2230.09true"high"

This is a Polars DataFrame of seven rows. All values in d were passing test units (i.e., fulfilled the expectation outlined in the validation step) and, in many ways, this is like a ‘good extract’.

You can always collect the failing rows with get_sundered_data() by using the type="fail" option. Trying that here

validation.get_sundered_data(type="fail")
shape: (6, 8)
date_timedateabcdef
datetime[μs]datei64stri64f64boolstr
2016-01-09 12:36:002016-01-098"3-ldm-038"7283.94true"low"
2016-01-15 18:46:002016-01-157"1-knw-093"3843.34true"high"
2016-01-20 04:30:002016-01-203"5-bce-642"9837.93false"high"
2016-01-20 04:30:002016-01-203"5-bce-642"9837.93false"high"
2016-01-26 20:07:002016-01-264"2-dmx-010"7833.98true"low"
2016-01-28 02:51:002016-01-282"7-dmx-010"8108.34false"low"

gives us the same rows as in the DataFrame obtained from using validation.get_data_extracts(i=1, frame=True). Two important things to know about get_sundered_data() is that the table rows returned from type=pass (the default) and type=fail are:

  • the sum of rows across these returned tables will be equal to that of the original table
  • the rows in each split table are mutually exclusive (i.e., you won’t find the same row in both)

You can think of sundered data as a filtered version of the original dataset based on validation results. While the simple example illustrates how this process works on a basic level, the value of the method is better seen in a slightly more complex example.

Using get_sundered_data() with a More Comprehensive Validation

The previous example used exactly one valiation step. You’re likely to use more than that in standard practice so let’s see how get_sundered_data() works in those common situations. Here’s a validation with three steps:

validation_2 = (
    pb.Validate(data=pb.load_dataset(dataset="small_table"))
    .col_vals_ge(columns="d", value=1000)
    .col_vals_not_null(columns="c")
    .col_vals_gt(columns="a", value=2)
    .interrogate()
)

validation_2
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W E C EXT
#4CA64C66 1
col_vals_gte
col_vals_ge()
d 1000 13 7
0.54
6
0.46
#4CA64C66 2
col_vals_not_null
col_vals_not_null()
c 13 11
0.85
2
0.15
#4CA64C66 3
col_vals_gt
col_vals_gt()
a 2 13 9
0.69
4
0.31

There are quite a few failures here across the three validation steps. In the FAIL column of the validation report table, there are 12 failing test units if we were to tally them up. So if the input table has 13 rows in total, does this mean there would be one row in the table returned by get_sundered_data()? Not so:

validation_2.get_sundered_data()
shape: (4, 8)
date_timedateabcdef
datetime[μs]datei64stri64f64boolstr
2016-01-04 00:32:002016-01-043"5-egh-163"89999.99true"low"
2016-01-05 13:32:002016-01-056"8-kdg-938"32343.23true"high"
2016-01-11 06:15:002016-01-114"2-dhe-923"43291.03true"mid"
2016-01-17 11:27:002016-01-174"5-boe-639"21035.64false"low"

There are four rows. This is because the different validation steps tested values in different columns of the table. Some of the failing test units had to have occurred in more than once in certain rows. The rows that didn’t have any failing test units across the three different tests (in three different columns) are the ones seen above. This brings us to the third important thing about the sundering process:

  • the absence of test-unit failures in a row across all validation steps means those rows are returned as the "pass" set, all others are placed in the "fail" set

In validations where many validation steps are used, we can be more confident about the level of data quality for those rows returned in the "pass" set. But not every type of validation step is considered within this splitting procedure. The next section will explain the rules on that.

The Validation Methods Considered When Sundering

The sundering procedure relies on row-level validation types to be used. This makes sense as it’s impossible to judge the quality of a row when using the col_exists() validation method, for example. Luckily, we have many row-level validation methods; here’s a list:

This is the same list of validation methods that are considered when creating data extracts.

There are some additional caveats though. Even if using a validation method drawn from the set above, the validation step won’t be used for sundering if:

  • the active= parameter for that step has been set to False
  • the pre= parameter has been used

The first one makes intuitive sense (you decided to skip this validation step entirely), the second one requires some explanation. Using pre= allows you to modify the target table, there’s no easy or practical way to compare rows in a mutated table compared to the original table (e.g., a mutation may drastically reduce the number of rows).

So long as you’re aware of the rules and limitations of sundering, you’ll hopefully find it to be a simple and useful way to filter your input table on the basis of a validation plan.