Get the rows that failed for each validation step.
Validate.get_data_extracts(
i=None,
frame=False,
)
After the interrogate() method has been called, the get_data_extracts() method can be used to extract the rows that failed in each column-value or row-based validation step (e.g., col_vals_gt(), rows_distinct(), etc.). The method returns a dictionary of tables containing the rows that failed in every validation step. If frame=True and i= is a scalar, the value is conveniently returned as a table (forgoing the dictionary structure).
Parameters
i: int | list[int] | None = None
-
The validation step number(s) from which the failed rows are obtained. Can be provided as a list of integers or a single integer. If None, all steps are included.
frame: bool = False
-
If
True and i= is a scalar, return the value as a DataFrame instead of a dictionary.
Returns
dict[int, Any] | Any
-
A dictionary of tables containing the rows that failed in every compatible validation step. Alternatively, it can be a DataFrame if
frame=True and i= is a scalar.
Examples
Let’s perform a series of validation steps on a Polars DataFrame. We’ll use the col_vals_gt() in the first step, col_vals_lt() in the second step, and col_vals_ge() in the third step. The interrogate() method executes the validation; then, we can extract the rows that failed for each validation step.
import pointblank as pb
import polars as pl
tbl = pl.DataFrame(
{
"a": [5, 6, 5, 3, 6, 1],
"b": [1, 2, 1, 5, 2, 6],
"c": [3, 7, 2, 6, 3, 1],
}
)
validation = (
pb.Validate(data=tbl)
.col_vals_gt(columns="a", value=4)
.col_vals_lt(columns="c", value=5)
.col_vals_ge(columns="b", value=1)
.interrogate()
)
validation.get_data_extracts()
{1: shape: (2, 4)
┌───────────┬─────┬─────┬─────┐
│ _row_num_ ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═════╪═════╪═════╡
│ 4 ┆ 3 ┆ 5 ┆ 6 │
│ 6 ┆ 1 ┆ 6 ┆ 1 │
└───────────┴─────┴─────┴─────┘,
2: shape: (2, 4)
┌───────────┬─────┬─────┬─────┐
│ _row_num_ ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═════╪═════╪═════╡
│ 2 ┆ 6 ┆ 2 ┆ 7 │
│ 4 ┆ 3 ┆ 5 ┆ 6 │
└───────────┴─────┴─────┴─────┘,
3: shape: (0, 4)
┌───────────┬─────┬─────┬─────┐
│ _row_num_ ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═════╪═════╪═════╡
└───────────┴─────┴─────┴─────┘}
The get_data_extracts() method returns a dictionary of tables, where each table contains a subset of rows from the table. These are the rows that failed for each validation step.
In the first step, thecol_vals_gt() method was used to check if the values in column a were greater than 4. The extracted table shows the rows where this condition was not met; look at the a column: all values are less than 4.
In the second step, the col_vals_lt() method was used to check if the values in column c were less than 5. In the extracted two-row table, we see that the values in column c are greater than 5.
The third step (col_vals_ge()) checked if the values in column b were greater than or equal to 1. There were no failing test units, so the extracted table is empty (i.e., has columns but no rows).
The i= argument can be used to narrow down the extraction to one or more steps. For example, to extract the rows that failed in the first step only:
validation.get_data_extracts(i=1)
{1: shape: (2, 4)
┌───────────┬─────┬─────┬─────┐
│ _row_num_ ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═════╪═════╪═════╡
│ 4 ┆ 3 ┆ 5 ┆ 6 │
│ 6 ┆ 1 ┆ 6 ┆ 1 │
└───────────┴─────┴─────┴─────┘}
Note that the first validation step is indexed at 1 (not 0). This 1-based indexing is in place here to match the step numbers reported in the validation table. What we get back is still a dictionary, but it only contains one table (the one for the first step).
If you want to get the extracted table as a DataFrame, set frame=True and provide a scalar value for i. For example, to get the extracted table for the second step as a DataFrame:
pb.preview(validation.get_data_extracts(i=2, frame=True))
The extracted table is now a DataFrame, which can serve as a more convenient format for further analysis or visualization. We further used the preview() function to show the DataFrame in an HTML view.