gt_plt_summary

gt_plt_summary(
    df,
    title=None,
    show_desc_stats=True,
    add_mode=False,
    interactivity=True,
    new_color_mapping=None,
)

Create a comprehensive data summary table with visualizations.

The gt_plt_summary() function takes a DataFrame and generates a summary table showing key statistics and visual representations for each column. Each row displays the column type, missing data percentage, descriptive statistics (mean, median, standard deviation), and a small plot overview appropriate for the data type (histograms for numeric and datetime and a categorical bar chart for strings).

Inspired by the Observable team and the observablehq/SummaryTable function: https://observablehq.com/@observablehq/summary-table

Parameters

df : IntoDataFrame: A DataFrame to summarize. Can be any DataFrame type that you would pass into a GT.
title : str | None = None: Optional title for the summary table. If None, defaults to “Summary Table”.
show_desc_stats : bool = True: Boolean that allows the hiding of the Mean, Median, and SD columns.
add_mode : bool = False: Boolean that allows the addition of a Mode column.
interactivity : bool = True: Boolean that toggles interactivity in Plot Overview column graphs. Interactivity refers to hovering css and tooltips code applied to the graphs.
new_color_mapping : dict | None = None: A dictionary that maps data types (string, numeric, datetime, boolean, and other) to their corresponding color codes in hexadecimal format.

Returns

: GT: A GT object containing the summary table with columns for Type, Column name, Plot Overview, Missing percentage, Mean, Median, and Standard Deviation.

Examples

import polars as pl
from great_tables import GT
import gt_extras as gte
from datetime import datetime

df = pl.DataFrame({
    "Date": [
        datetime(2024, 1, 1),
        datetime(2024, 1, 2),
        datetime(2024, 1, 7),
        datetime(2024, 1, 8),
        datetime(2024, 1, 13),
        datetime(2024, 1, 16),
        datetime(2024, 1, 20),
        datetime(2024, 1, 22),
        datetime(2024, 2, 1),
    ] * 5,
    "Value": [10, 15, 20, None, 25, 18, 22, 30, 40] * 5,
    "Category": ["A", "B", "C", "A", "B", "C", "D", None, None] * 5,
    "Boolean": [True, False, True] * 15,
    "Status": ["Active", "Inactive", None] * 15,
})

gte.gt_plt_summary(df)

Type	Column	Missing	Mean	Median	SD
Summary Table
45 rows x 5 cols
	Date	0.0%	—	—	—
	Value	11.1%	22.50	21.00	8.83
	Category	22.2%	—	—	—
	Boolean	0.0%	0.67	—	—
	Status	33.3%	—	—	—

And an example with some satisfying numeric data:

import random

n = 100
random.seed(23)

uniform = [random.uniform(0, 10) for _ in range(n)]
for i in range(2, 10):
    uniform[i] = None

normal = [random.gauss(5, 2) for _ in range(n)]
normal[4] = None
normal[10] = None

single_tailed = [random.expovariate(1/2) for _ in range(n)]

bimodal = [random.gauss(2, 0.5) for _ in range(n // 2)] + [random.gauss(8, 0.5) for _ in range(n - n // 2)]

df = pl.DataFrame({
    "uniform": uniform,
    "normal": normal,
    "single_tailed": single_tailed,
    "bimodal": bimodal,
})

gte.gt_plt_summary(df)

Type	Column	Missing	Mean	Median	SD
Summary Table
100 rows x 4 cols
	uniform	8.0%	5.13	5.21	2.96
	normal	2.0%	5.29	5.11	1.91
	single_tailed	0.0%	1.93	1.52	1.85
	bimodal	0.0%	4.86	4.86	3.03

And lastly, an example showing ocean swell data with changes to the default color mapping:

import polars as pl
from great_tables import GT
import gt_extras as gte
from datetime import datetime

df = pl.DataFrame({
"Date": [
    datetime(2024, 7, 1, 6, 0),
    datetime(2024, 7, 1, 12, 0),
    datetime(2024, 7, 2, 6, 0),
    datetime(2024, 7, 2, 12, 0),
    datetime(2024, 7, 3, 6, 0),
    datetime(2024, 7, 3, 12, 0),
    datetime(2024, 7, 4, 6, 0),
    datetime(2024, 7, 4, 12, 0),
    datetime(2024, 7, 5, 6, 0),
],
"Height_m": [1.2, 1.5, 2.1, 2.4, 1.8, None, 2.7, 3.0, 2.5],
"Period_s": [10, 12, 14, 15, 11, 9, 16, None, 13],
"Direction_deg": [210, 215, 220, 225, 205, 200, 230, 240, 235],
"WindSpeed_kts": [5, 7, 10, 12, 6, 4, 8, 11, None],
"Breaking": [True, True, True, False, True, False, True, True, True]
})

color_mapping = {
    "date": "blue",
    "numeric": "lightblue",
    "boolean": "lightgreen",
}

gte.gt_plt_summary(df, new_color_mapping=color_mapping)

Type	Column	Missing	Mean	Median	SD
Summary Table
9 rows x 6 cols
	Date	0.0%	—	—	—
	Height_m	11.1%	2.15	2.25	0.62
	Period_s	11.1%	12.50	12.50	2.45
	Direction_deg	0.0%	220.00	220.00	13.69
	WindSpeed_kts	11.1%	7.88	7.50	2.90
	Breaking	0.0%	0.78	—	—

Note

The datatype (dtype) of each column in your dataframe will determine the classified type in the summary table. Keep in mind that sometimes pandas or polars have differing behaviors with datatypes, especially when null values are present.