Generate a column-level summary table of a dataset.
USAGE
col_summary_tbl(data, tbl_name=None)
The col_summary_tbl() function generates a summary table of a dataset, focusing on providing column-level information about the dataset. The summary includes the following information:
the type of the table (e.g., "polars", "pandas", etc.)
the number of rows and columns in the table
column-level information, including:
the column name
the column type
measures of missingness and distinctness
descriptive stats and quantiles
statistics for datetime columns
The summary table is returned as a GT object, which can be displayed in a notebook or saved to an HTML file.
The table to summarize, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the Supported Input Table Types section for details on the supported table types.
tbl_name:str | None=None
Optionally, the name of the table could be provided as tbl_name=.
Returns
GT
A GT object that displays the column-level summaries of the table.
Supported Input Table Types
The data= parameter can be given any of the following table types:
Polars DataFrame ("polars")
Pandas DataFrame ("pandas")
DuckDB table ("duckdb")*
MySQL table ("mysql")*
PostgreSQL table ("postgresql")*
SQLite table ("sqlite")*
Parquet table ("parquet")*
CSV files (string path or pathlib.Path object with .csv extension)
Parquet files (string path, pathlib.Path object, glob pattern, directory with .parquet extension, or partitioned dataset)
GitHub URLs (direct links to CSV or Parquet files on GitHub)
Database connection strings (URI format with optional table specification)
The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, using col_summary_tbl() with these types of tables requires the Ibis library (v9.5.0 or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed.
Examples
It’s easy to get a column-level summary of a table using the col_summary_tbl() function. Here’s an example using the small_table dataset (itself loaded using the load_dataset() function):
import pointblank as pbsmall_table = pb.load_dataset(dataset="small_table", tbl_type="polars")pb.col_summary_tbl(data=small_table)
PolarsRows13Columns8
Column
NA
UQ
Mean
SD
Min
P5
Q1
Med
Q3
P95
Max
IQR
date_time
Datetime(time_unit='us', time_zone=None)
0 0
12 0.92
-
-
2016 01 04 00:32:00
-
-
-
-
-
2016 01 30 11:23:00
-
date
Date
0 0
11 0.85
-
-
2016 01 04
-
-
-
-
-
2016 01 30
-
a
Int64
0 0
7 0.54
3.77
2.09
1
1.06
2
3
4
7.4
8
2
b
String
0 0
12 0.92
9
0
9
9
9
9
9
9
9
0
c
Int64
2 0.15
7 0.54
5.73
2.72
2
2.05
3
7
8
9
9
5
d
Float64
0 0
12 0.92
2,304.7
2,631.36
108.34
118.88
837.93
1,035.64
3,291.03
6,335.44
9999.99
2,453.1
e
Boolean
0 0
T0.62 F0.38
-
-
-
-
-
-
-
-
-
-
f
String
0 0
3 0.23
3.46
0.52
3
3
3
3
4
4
4
1
String columns statistics regard the string's length.
This table used above was a Polars DataFrame, but the col_summary_tbl() function works with any table supported by pointblank, including Pandas DataFrames and Ibis backend tables. Here’s an example using a DuckDB table handled by Ibis:
/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/great_tables/_tbl_data.py:817: UserWarning: PyArrow Table support is currently experimental.
warnings.warn("PyArrow Table support is currently experimental.")
Rows336,776Columns18
Column
NA
UQ
Mean
SD
Min
P5
Q1
Med
Q3
P95
Max
IQR
year
Int64
0 0
1 <.01
2,013
0
2,013
2,013
2,013
2,013
2,013
2,013
2,013
0
month
Int64
0 0
12 <.01
6.55
3.41
1
1
4
6.77
10
12
12
6
day
Int64
0 0
31 <.01
15.71
8.77
1
1
8
15.64
23
29
31
15
dep_time
Int64
8255 0.02
1319 <.01
1,349.11
488.28
1
514
907
1,381.15
1,744
2,112
2,400
837
sched_dep_time
Int64
0 0
1021 <.01
1,344.25
467.34
106
545
906
1,377.13
1,729
2,050
2,359
823
dep_delay
Int64
8255 0.02
528 <.01
12.64
40.21
−43
−13
−5
−1.58
11
88
1,301
16
arr_time
Int64
8713 0.03
1412 <.01
1,502.05
533.26
1
10
1,104
1,539.25
1,940
2,248
2,400
836
sched_arr_time
Int64
0 0
1163 <.01
1,536.38
497.46
1
15
1,124
1,575.93
1,945
2,246
2,359
821
arr_delay
Int64
9430 0.03
578 <.01
6.9
44.63
−86
−48
−17
−4.92
14
91
1,272
31
carrier
String
0 0
16 <.01
2
0
2
2
2
2
2
2
2
0
flight
Int64
0 0
3844 0.01
1,971.92
1,632.47
1
4
553
1,499.04
3,465
4,695
8,500
2,912
tailnum
String
2512 <.01
4044 0.01
6
0.07
5
6
6
6
6
6
6
0
origin
String
0 0
3 <.01
3
0
3
3
3
3
3
3
3
0
dest
String
0 0
105 <.01
3
0
3
3
3
3
3
3
3
0
air_time
Int64
9430 0.03
510 <.01
150.69
93.69
20
31
82
129.35
192
339
695
110
distance
Int64
0 0
214 <.01
1,039.91
733.23
17
116
502
861.05
1,389
2,475
4,983
887
hour
Int64
0 0
20 <.01
13.18
4.66
1
5
9
13.4
17
20
23
8
minute
Int64
0 0
60 <.01
26.23
19.3
0
0
8
28.3
44
58
59
36
String columns statistics regard the string's length.