col_summary_tbl()function

Generate a column-level summary table of a dataset.

USAGE

col_summary_tbl(data, tbl_name=None)

The col_summary_tbl() function generates a summary table of a dataset, focusing on providing column-level information about the dataset. The summary includes the following information:

The summary table is returned as a GT object, which can be displayed in a notebook or saved to an HTML file.

Warning

The col_summary_tbl() function is still experimental. Please report any issues you encounter in the Pointblank issue tracker.

Parameters

data : FrameT | Any

The table to summarize, which could be a DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database connection string. Read the Supported Input Table Types section for details on the supported table types.

tbl_name : str | None = None

Optionally, the name of the table could be provided as tbl_name=.

Returns

GT

A GT object that displays the column-level summaries of the table.

Supported Input Table Types

The data= parameter can be given any of the following table types:

  • Polars DataFrame ("polars")
  • Pandas DataFrame ("pandas")
  • DuckDB table ("duckdb")*
  • MySQL table ("mysql")*
  • PostgreSQL table ("postgresql")*
  • SQLite table ("sqlite")*
  • Parquet table ("parquet")*
  • CSV files (string path or pathlib.Path object with .csv extension)
  • Parquet files (string path, pathlib.Path object, glob pattern, directory with .parquet extension, or partitioned dataset)
  • GitHub URLs (direct links to CSV or Parquet files on GitHub)
  • Database connection strings (URI format with optional table specification)

The table types marked with an asterisk need to be prepared as Ibis tables (with type of ibis.expr.types.relations.Table). Furthermore, using col_summary_tbl() with these types of tables requires the Ibis library (v9.5.0 or above) to be installed. If the input table is a Polars or Pandas DataFrame, the availability of Ibis is not needed.

Examples


It’s easy to get a column-level summary of a table using the col_summary_tbl() function. Here’s an example using the small_table dataset (itself loaded using the load_dataset() function):

import pointblank as pb

small_table = pb.load_dataset(dataset="small_table", tbl_type="polars")

pb.col_summary_tbl(data=small_table)
PolarsRows13Columns8
Column NA UQ Mean SD Min P5 Q1 Med Q3 P95 Max IQR
date
date_time
Datetime(time_unit='us', time_zone=None)
0
0
12
0.92
- - 2016
01
04 00:32:00
- - - - - 2016
01
30 11:23:00
-
date
date
Date
0
0
11
0.85
- - 2016
01
04
- - - - - 2016
01
30
-
numeric
a
Int64
0
0
7
0.54
3.77 2.09 1 1.06 2 3 4 7.4 8 2
string
b
String
0
0
12
0.92
9 0 9 9 9 9 9 9 9 0
numeric
c
Int64
2
0.15
7
0.54
5.73 2.72 2 2.05 3 7 8 9 9 5
numeric
d
Float64
0
0
12
0.92
2,304.7 2,631.36 108.34 118.88 837.93 1,035.64 3,291.03 6,335.44 9999.99 2,453.1
boolean
e
Boolean
0
0
T0.62
F0.38
- - - - - - - - - -
string
f
String
0
0
3
0.23
3.46 0.52 3 3 3 3 4 4 4 1
String columns statistics regard the string's length.

This table used above was a Polars DataFrame, but the col_summary_tbl() function works with any table supported by pointblank, including Pandas DataFrames and Ibis backend tables. Here’s an example using a DuckDB table handled by Ibis:

nycflights = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")

pb.col_summary_tbl(data=nycflights, tbl_name="nycflights")
/opt/hostedtoolcache/Python/3.11.13/x64/lib/python3.11/site-packages/great_tables/_tbl_data.py:817: UserWarning: PyArrow Table support is currently experimental.
  warnings.warn("PyArrow Table support is currently experimental.")
Rows336,776Columns18
Column NA UQ Mean SD Min P5 Q1 Med Q3 P95 Max IQR
numeric
year
Int64
0
0
1
<.01
2,013 0 2,013 2,013 2,013 2,013 2,013 2,013 2,013 0
numeric
month
Int64
0
0
12
<.01
6.55 3.41 1 1 4 6.77 10 12 12 6
numeric
day
Int64
0
0
31
<.01
15.71 8.77 1 1 8 15.64 23 29 31 15
numeric
dep_time
Int64
8255
0.02
1319
<.01
1,349.11 488.28 1 514 907 1,381.15 1,744 2,112 2,400 837
numeric
sched_dep_time
Int64
0
0
1021
<.01
1,344.25 467.34 106 545 906 1,377.13 1,729 2,050 2,359 823
numeric
dep_delay
Int64
8255
0.02
528
<.01
12.64 40.21 −43 −13 −5 −1.58 11 88 1,301 16
numeric
arr_time
Int64
8713
0.03
1412
<.01
1,502.05 533.26 1 10 1,104 1,539.25 1,940 2,248 2,400 836
numeric
sched_arr_time
Int64
0
0
1163
<.01
1,536.38 497.46 1 15 1,124 1,575.93 1,945 2,246 2,359 821
numeric
arr_delay
Int64
9430
0.03
578
<.01
6.9 44.63 −86 −48 −17 −4.92 14 91 1,272 31
string
carrier
String
0
0
16
<.01
2 0 2 2 2 2 2 2 2 0
numeric
flight
Int64
0
0
3844
0.01
1,971.92 1,632.47 1 4 553 1,499.04 3,465 4,695 8,500 2,912
string
tailnum
String
2512
<.01
4044
0.01
6 0.07 5 6 6 6 6 6 6 0
string
origin
String
0
0
3
<.01
3 0 3 3 3 3 3 3 3 0
string
dest
String
0
0
105
<.01
3 0 3 3 3 3 3 3 3 0
numeric
air_time
Int64
9430
0.03
510
<.01
150.69 93.69 20 31 82 129.35 192 339 695 110
numeric
distance
Int64
0
0
214
<.01
1,039.91 733.23 17 116 502 861.05 1,389 2,475 4,983 887
numeric
hour
Int64
0
0
20
<.01
13.18 4.66 1 5 9 13.4 17 20 23 8
numeric
minute
Int64
0
0
60
<.01
26.23 19.3 0 0 8 28.3 44 58 59 36
String columns statistics regard the string's length.