import pointblank as pb
import polars as pl
= pl.DataFrame(
tbl
{"col_1": ["a", "b", "c", "d"],
"col_2": ["a", "a", "c", "d"],
"col_3": ["a", "a", "d", "e"],
}
)
pb.preview(tbl)
Validate.rows_distinct
Validate.rows_distinct(=None,
columns_subset=None,
pre=None,
thresholds=None,
actions=None,
brief=True,
active )
Validate whether rows in the table are distinct.
The rows_distinct()
method checks whether rows in the table are distinct. This validation will operate over the number of test units that is equal to the number of rows in the table (determined after any pre=
mutation has been applied).
Parameters
columns_subset :
str
|list
[str
] | None = None-
A single column or a list of columns to use as a subset for the distinct comparison. If
None
, then all columns in the table will be used for the comparison. If multiple columns are supplied, the distinct comparison will be made over the combination of values in those columns. pre :
Callable
| None = None-
A optional preprocessing function or lambda to apply to the data table during interrogation.
thresholds :
int
|float
|bool
|tuple
|dict
| Thresholds = None-
Optional failure threshold levels for the validation step(s), so that the interrogation can react accordingly when exceeding the set levels for different states (‘warning’, ‘error’, and ‘critical’). This can be created using the
Thresholds
class or more simply as (1) an integer or float denoting the absolute number or fraction of failing test units for the ‘warn’ level, (2) a tuple of 1-3 values, or (3) a dictionary of 1-3 entries. actions : Actions | None = None
-
Optional actions to take when the validation step meets or exceeds any set threshold levels. If provided, the
Actions
class should be used to define the actions. brief :
str
| None = None-
An optional brief description of the validation step. The templating elements
"{col}"
and"{step}"
can be used to insert the column name and step number, respectively. active :
bool
= True-
A boolean value indicating whether the validation step should be active. Using
False
will make the validation step inactive (still reporting its presence and keeping indexes for the steps unchanged).
Returns
: Validate
-
The
Validate
object with the added validation step.
Examples
For the examples here, we’ll use a simple Polars DataFrame with three string columns (col_1
, col_2
, and col_3
). The table is shown below:
Let’s validate that the rows in the table are distinct with rows_distinct()
. We’ll determine if this validation had any failing test units (there are four test units, one for each row). A failing test units means that a given row is not distinct from every other row.
= (
validation =tbl)
pb.Validate(data
.rows_distinct()
.interrogate()
)
validation
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C | 1 |
|
✓ | 4 | 4 1.00 |
0 0.00 |
— | — | — | — |
From this validation table we see that there are no failing test units. All rows in the table are distinct from one another.
We can also use a subset of columns to determine distinctness. Let’s specify the subset using columns col_2
and col_3
for the next validation.
= (
validation =tbl)
pb.Validate(data=["col_2", "col_3"])
.rows_distinct(columns_subset
.interrogate()
)
validation
STEP | COLUMNS | VALUES | TBL | EVAL | UNITS | PASS | FAIL | W | E | C | EXT | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#4CA64C66 | 1 |
|
✓ | 4 | 2 0.50 |
2 0.50 |
— | — | — | — |
The validation table reports two failing test units. The first and second rows are duplicated when considering only the values in columns col_2
and col_3
. There’s only one set of duplicates but there are two failing test units since each row is compared to all others.