DataScan
self, data, tbl_name=None) DataScan(
Get a summary of a dataset.
The DataScan
class provides a way to get a summary of a dataset. The summary includes the following information:
- the name of the table (if provided)
- the type of the table (e.g.,
"polars"
,"pandas"
, etc.) - the number of rows and columns in the table
- column-level information, including:
- the column name
- the column type
- measures of missingness and distinctness
- measures of negative, zero, and positive values (for numerical columns)
- a sample of the data (the first 5 values)
- statistics (if the column contains numbers, strings, or datetimes)
To obtain a dictionary representation of the summary, you can use the to_dict()
method. To get a JSON representation of the summary, you can use the to_json()
method. To save the JSON text to a file, the save_to_json()
method could be used.
The DataScan()
class is still experimental. Please report any issues you encounter in the Pointblank issue tracker.
Parameters
data :
FrameT
|Any
-
The data to scan and summarize.
tbl_name :
str
| None = None-
Optionally, the name of the table could be provided as
tbl_name
.
Measures of Missingness and Distinctness
For each column, the following measures are provided:
n_missing_values
: the number of missing values in the columnf_missing_values
: the fraction of missing values in the columnn_unique_values
: the number of unique values in the columnf_unique_values
: the fraction of unique values in the column
The fractions are calculated as the ratio of the measure to the total number of rows in the dataset.
Counts and Fractions of Negative, Zero, and Positive Values
For numerical columns, the following measures are provided:
n_negative_values
: the number of negative values in the columnf_negative_values
: the fraction of negative values in the columnn_zero_values
: the number of zero values in the columnf_zero_values
: the fraction of zero values in the columnn_positive_values
: the number of positive values in the columnf_positive_values
: the fraction of positive values in the column
The fractions are calculated as the ratio of the measure to the total number of rows in the dataset.
Statistics for Numerical Columns
For numerical columns, the following descriptive statistics are provided:
mean
: the mean of the columnstd_dev
: the standard deviation of the column
Additionally, the following quantiles are provided:
min
: the minimum value in the columnp05
: the 5th percentile of the columnq_1
: the first quartile of the columnmed
: the median of the columnq_3
: the third quartile of the columnp95
: the 95th percentile of the columnmax
: the maximum value in the columniqr
: the interquartile range of the column
Statistics for String Columns
For string columns, the following statistics are provided:
mode
: the mode of the column
Statistics for Datetime Columns
For datetime columns, the following statistics are provided:
min_date
: the minimum date in the columnmax_date
: the maximum date in the column
Returns
: DataScan
-
A DataScan object.