col_summary_tbl

col_summary_tbl(data, tbl_name=None)

Generate a column-level summary table of a dataset.

The col_summary_tbl() function generates a summary table of a dataset, focusing on providing column-level information about the dataset. The summary includes the following information:

The summary table is returned as a GT object, which can be displayed in a notebook or saved to an HTML file.

Warning

The col_summary_tbl() function is still experimental. Please report any issues you encounter in the Pointblank issue tracker.

Parameters

data : FrameT | Any

The table to summarize, which could be a DataFrame object or an Ibis table object. Read the Supported Input Table Types section for details on the supported table types.

tbl_name : str | None = None

Optionally, the name of the table could be provided as tbl_name=.

Returns

: GT

A GT object that displays the column-level summaries of the table.

Supported Input Table Types

The data= parameter can be given any of the following table types:

  • Polars DataFrame ("polars")
  • Pandas DataFrame ("pandas")
  • DuckDB table ("duckdb")*
  • MySQL table ("mysql")*
  • PostgreSQL table ("postgresql")*
  • SQLite table ("sqlite")*
  • Parquet table ("parquet")*

The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, using col_summary_tbl() with these types of tables requires the Ibis library (v9.5.0 or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed.

Examples

It’s easy to get a column-level summary of a table using the col_summary_tbl() function. Here’s an example using the small_table dataset (itself loaded using the load_dataset() function):

import pointblank as pb

small_table_polars = pb.load_dataset(dataset="small_table", tbl_type="polars")

pb.col_summary_tbl(data=small_table_polars)
PolarsRows13Columns8
Column NA UQ Mean SD Min P5 Q1 Med Q3 P95 Max IQR
1 date
date_time
Datetime(time_unit='us', time_zone=None)
0
0.00
12
0.92
 2016-01-04 00:32:00 – 2016-01-30 11:23:00
2 date
date
Date
0
0.00
11
0.85
 2016-01-04 – 2016-01-30
3 numeric
a
Int64
0
0.00
7
0.54
3.77 2.09 1.00 1.60 2.00 3.00 4.00 7.40 8.00 2.00
4 string
b
String
0
0.00
12
0.92
9.00
SL
0.00
SL
9
SL
9
SL
9
SL
5 numeric
c
Int64
2
0.15
6
0.46
5.73 2.72 2.00 2.50 3.00 7.00 8.00 9.00 9.00 5.00
6 numeric
d
Float64
0
0.00
12
0.92
2305 2631 108 214 838 1036 3291 6335 10000 2453
7 boolean
e
Boolean
0
0.00
T 0.61
F 0.39
8 string
f
String
0
0.00
3
0.23
3.46
SL
0.52
SL
3
SL
3
SL
4
SL

This table used above was a Polars DataFrame, but the col_summary_tbl() function works with any table supported by pointblank, including Pandas DataFrames and Ibis backend tables. Here’s an example using a DuckDB table handled by Ibis:

small_table_duckdb = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")

pb.col_summary_tbl(data=small_table_duckdb, tbl_name="nycflights")
DuckDBnycflightsRows336,776Columns18
Column NA UQ Mean SD Min P5 Q1 Med Q3 P95 Max IQR
1 numeric
year
int64
0
0.00
1
<0.01
2013 0.00 2013 2013 2013 2013 2013 2013 2013 0
2 numeric
month
int64
0
0.00
12
<0.01
6.55 3.41 1.00 1.00 4.00 7.00 9.28 12.0 12.0 6.00
3 numeric
day
int64
0
0.00
31
<0.01
15.7 8.77 1.00 2.00 8.06 16.0 23.0 29.4 31.0 15.0
4 numeric
dep_time
int64
8255
0.02
1317
<0.01
1349 488 1.00 623 905 1401 1744 2111 2400 837
5 numeric
sched_dep_time
int64
0
0.00
1021
<0.01
1344 467 106 630 909 1359 1729 2051 2359 823
6 numeric
dep_delay
int64
8255
0.02
526
<0.01
12.6 40.2 −43.0 −9.00 −5.00 −2.00 10.8 88.3 1301 16.0
7 numeric
arr_time
int64
8713
0.03
1410
<0.01
1502 533 1.00 734 1096 1535 1941 2248 2400 836
8 numeric
sched_arr_time
int64
0
0.00
1163
<0.01
1536 497 1.00 816 1124 1556 1945 2247 2359 821
9 numeric
arr_delay
int64
9430
0.03
576
<0.01
6.90 44.6 −86.0 −32.2 −16.8 −5.00 13.8 91.2 1272 31.0
10 string
carrier
string
0
0.00
16
<0.01
2.00
SL
0.00
SL
2
SL
2
SL
2
SL
11 numeric
flight
int64
0
0.00
3844
0.01
1972 1632 1.00 93.7 558 1496 3465 4703 8500 2912
12 string
tailnum
string
2512
<0.01
4042
0.01
6.00
SL
0.07
SL
5
SL
6
SL
6
SL
13 string
origin
string
0
0.00
3
<0.01
3.00
SL
0.00
SL
3
SL
3
SL
3
SL
14 string
dest
string
0
0.00
105
<0.01
3.00
SL
0.00
SL
3
SL
3
SL
3
SL
15 numeric
air_time
int64
9430
0.03
508
<0.01
151 93.7 20.0 40.0 82.3 129 191 339 695 110
16 numeric
distance
int64
0
0.00
214
<0.01
1040 733 17.0 198 509 872 1389 2477 4983 887
17 numeric
hour
int64
0
0.00
20
<0.01
13.2 4.66 1 6 9 13 17 20 23 8
18 numeric
minute
int64
0
0.00
60
<0.01
26.2 19.3 0.00 0.00 7.94 29.0 43.6 58.0 59.0 36.0