gt-extras
  • Intro
  • Examples
  • API Reference
  1. Plotting
  2. gt_plt_summary
  • API Reference
  • Plotting
    • gt_plt_bar
    • gt_plt_bar_pct
    • gt_plt_bar_stack
    • gt_plt_bullet
    • gt_plt_conf_int
    • gt_plt_dot
    • gt_plt_dumbbell
    • gt_plt_summary
    • gt_plt_winloss
  • Colors
    • gt_color_box
    • gt_data_color_by_group
    • gt_highlight_cols
    • gt_highlight_rows
    • gt_hulk_col_numeric
  • Themes
    • gt_theme_538
    • gt_theme_dark
    • gt_theme_dot_matrix
    • gt_theme_espn
    • gt_theme_excel
    • gt_theme_guardian
    • gt_theme_nytimes
    • gt_theme_pff
  • Icons and Images
    • add_text_img
    • fa_icon_repeat
    • gt_fa_rank_change
    • gt_fa_rating
    • img_header
  • Utilities
    • fmt_pct_extra
    • gt_add_divider
    • gt_duplicate_column
    • gt_merge_stack
    • gt_two_column_layout
    • with_hyperlink
    • with_tooltip

On this page

  • Parameters
  • Returns
  • Examples
  • Note
  1. Plotting
  2. gt_plt_summary

gt_plt_summary

gt_plt_summary(df, title=None)

Create a comprehensive data summary table with visualizations.

The gt_plt_summary() function takes a DataFrame and generates a summary table showing key statistics and visual representations for each column. Each row displays the column type, missing data percentage, descriptive statistics (mean, median, standard deviation), and a small plot overview appropriate for the data type (histograms for numeric and datetime and a categorical bar chart for strings).

Inspired by the Observable team and the observablehq/SummaryTable function: https://observablehq.com/@observablehq/summary-table

Parameters

df : IntoDataFrame

A DataFrame to summarize. Can be any DataFrame type that you would pass into a GT.

title : str | None = None

Optional title for the summary table. If None, defaults to “Summary Table”.

Returns

: GT

A GT object containing the summary table with columns for Type, Column name, Plot Overview, Missing percentage, Mean, Median, and Standard Deviation.

Examples

import polars as pl
from great_tables import GT
import gt_extras as gte
from datetime import datetime

df = pl.DataFrame({
    "Date": [
        datetime(2024, 1, 1),
        datetime(2024, 1, 2),
        datetime(2024, 1, 7),
        datetime(2024, 1, 8),
        datetime(2024, 1, 13),
        datetime(2024, 1, 16),
        datetime(2024, 1, 20),
        datetime(2024, 1, 22),
        datetime(2024, 2, 1),
    ] * 5,
    "Value": [10, 15, 20, None, 25, 18, 22, 30, 40] * 5,
    "Category": ["A", "B", "C", "A", "B", "C", "D", None, None] * 5,
    "Boolean": [True, False, True] * 15,
    "Status": ["Active", "Inactive", None] * 15,
})

gte.gt_plt_summary(df)
Summary Table
45 rows x 5 cols
Type Column Plot Overview Missing Mean Median SD
Clock Date 2024-01-012024-02-0115 rows[2024-01-01 to 2024-01-07]10 rows[2024-01-07 to 2024-01-13]5 rows[2024-01-13 to 2024-01-19]10 rows[2024-01-19 to 2024-01-25]5 rows[2024-01-25 to 2024-02-01] 0.0% — — —
signal Value 10405 rows[10 to 15]10 rows[15 to 20]10 rows[20 to 25]5 rows[25 to 30]5 rows[30 to 35]5 rows[35 to 40] 11.1% 22.50 21.00 8.83
List Category 10 rows"A"10 rows"B"10 rows"C"5 rows"D" 22.2% — — —
Check Boolean 30 rows"True"15 rows"False" 0.0% 0.67 — —
List Status 15 rows"Active"15 rows"Inactive" 33.3% — — —

And an example with some satisfying numeric data:

import random

n = 100
random.seed(23)

uniform = [random.uniform(0, 10) for _ in range(n)]
for i in range(2, 10):
    uniform[i] = None

normal = [random.gauss(5, 2) for _ in range(n)]
normal[4] = None
normal[10] = None

single_tailed = [random.expovariate(1/2) for _ in range(n)]

bimodal = [random.gauss(2, 0.5) for _ in range(n // 2)] + [random.gauss(8, 0.5) for _ in range(n - n // 2)]

df = pl.DataFrame({
    "uniform": uniform,
    "normal": normal,
    "single_tailed": single_tailed,
    "bimodal": bimodal,
})

gte.gt_plt_summary(df)
Summary Table
100 rows x 4 cols
Type Column Plot Overview Missing Mean Median SD
signal uniform 0.239.8820 rows[0.23 to 2.16]17 rows[2.16 to 4.09]19 rows[4.09 to 6.02]14 rows[6.02 to 7.95]22 rows[7.95 to 9.88] 8.0% 5.13 5.21 2.96
signal normal 1.479.87 rows[1.47 to 2.66]18 rows[2.66 to 3.85]24 rows[3.85 to 5.04]18 rows[5.04 to 6.23]15 rows[6.23 to 7.42]12 rows[7.42 to 8.61]4 rows[8.61 to 9.8] 2.0% 5.29 5.11 1.91
signal single_tailed 0.048.0829 rows[0.04 to 0.77]20 rows[0.77 to 1.5]24 rows[1.5 to 2.23]9 rows[2.23 to 2.96]7 rows[2.96 to 3.69]3 rows[3.69 to 4.42]1 row[4.42 to 5.15]1 row[5.15 to 5.88]0 rows[5.88 to 6.62]1 row[6.62 to 7.35]5 rows[7.35 to 8.08] 0.0% 1.93 1.52 1.85
signal bimodal 0.69.0647 rows[0.6 to 2.71]3 rows[2.71 to 4.83]3 rows[4.83 to 6.94]47 rows[6.94 to 9.06] 0.0% 4.86 4.86 3.03

Note

The datatype (dtype) of each column in your dataframe will determine the classified type in the summary table. Keep in mind that sometimes pandas or polars have differing behaviors with datatypes, especially when null values are present.

gt_plt_dumbbell
gt_plt_winloss