Great Tables is now BYODF (Bring Your Own DataFrame)

Author

Michael Chow

Published

April 24, 2024

A few months ago, we released a blog post about how much we loved the combination of Polars and Great Tables. We found that Polars lazy expression system opened up convenient ways to conditionally format tables for presentation. However, excited as we were, we were harboring a shameful secret: Great Tables enabled Polars as an optional dependency, but had a hard dependency on the alternative DataFrame library Pandas.

We’re happy to share that Great Tables v0.5.0 makes Pandas an optional dependency. Using Pandas DataFrames as inputs is still fully supported. The optional dependency simply allows users of one DataFrame library to not have to install the other.

In this post, I’ll cover three important pieces:

Where we are today on dependencies
The challenge of removing hard dependencies
How we made Pandas optional

This may seem over the top, but many DataFrame implementations exist in the Python world. Enabling folks to BYODF (Bring Your Own DataFrame) is a tricky, rewarding challenge!

The state of Great Tables dependencies

Currently, Great Tables has two “sizes” of libraries it depends on:

Small: utility libraries for things like datetime localization.
Big: a lingering dependency on numpy in a few places (like nanoplots).

For small utilities, we depend on Babel, which makes it easier to say things like, “format this number as if I’m in Germany”.

For big dependencies, numpy should be fairly straightforward to remove (see this issue). We also still rely on Pandas for datasets in great_tables.data, but we will remove it soon (see this issue).

Removing dependencies like numpy and Pandas helps people who are in restricted computing environments, want a more lightweight install, or who are stuck depending on a much earlier version of a package. It also helps us keep a clean separation of concerns. Without clear boundaries, it’s too tempting to reach for things like pd.isna() in a pinch, or smear library specific versions of missingness across our code (e.g. pd.NA, np.nan, polars.NullType).

The challenge of removing hard dependencies

Removing hard dependencies on DataFrame libraries is worthwhile, but requires special handling for all DataFrame specific actions. To illustrate consider the Great Tables output below, which is produced from a Pandas DataFrame:

import pandas as pd
import polars as pl
from great_tables import GT

df_pandas = pd.DataFrame({"x": ["a", "b"], "y": [1.01, 2.0]})
df_polars = pl.from_pandas(df_pandas)

GT(df_pandas)

x	y
a	1.01
b	2.0

Producing this table includes two actions on the DataFrame:

Get column names: these are used for column labels (and other things).
Get column types: these are used for alignment (e.g. numeric column is right aligned).

While these actions may seem simple, they require different methods for different DataFrame implementations. In this post, we’ll focus specifically on the challenge of getting column names.

Getting column names

The code below shows the different methods required to get column names as a list from Pandas and Polars.

df_pandas.columns.tolist()  # pandas
df_polars.columns           # polars

['x', 'y']

Notice that the two lines of code aren’t too different—Pandas just requires an extra .tolist() piece. We could create a special function, that returns a list of names, depending on the type of the input DataFrame.

def get_column_names(data) -> list[str]:

    # pandas specific ----
    if isinstance(data, pd.DataFrame):
        return data.columns.tolist()

    # polars specific ----
    elif isinstance(data, pl.DataFrame):
        return data.columns

    raise TypeError(f"Unsupported type {type(data)}")

The function works great, in that we can call it on either DataFrame, but it lacks two things. The first is dependency inversion, since it requires importing both Pandas and Polars (creating a hard dependency). The second is separation of concerns, since Pandas and Polars code is mixed together. In this case adding more DataFrame implementations would create a hot stew of logic.

How we made Pandas optional

We were able to make Pandas optional in a sane manner through two moves:

databackend: perform isinstance checks without importing anything.
singledispatch: split out functions like get_column_names() into DataFrame specific versions.

Inverting dependency with databackend

Inverting dependency on DataFrame libraries means that we check whether something is a specific type of DataFrame, without using imports. This is done through the package databackend, which we copied into Great Tables.

It works by creating placeholder classes, which stand in for the DataFrames they’re detecting:

from great_tables._databackend import AbstractBackend


class PdDataFrame(AbstractBackend):
    _backends = [("pandas", "DataFrame")]


class PlDataFrame(AbstractBackend):
    _backends = [("polars", "DataFrame")]


if isinstance(df_pandas, PdDataFrame):
    print("I'm a pandas DataFrame!!!")

I'm a pandas DataFrame!!!

Note that the PdDataFrame above is able to detect a Pandas DataFrame without importing Pandas, by taking advantage of a bit of logic called a counterfactual:

assumption: if df_pandas is a Pandas DataFrame, then Pandas has been imported.
counterfactual: if Pandas has not been imported, then df_pandas is not a Pandas DataFrame.

This lets it quickly rule out a potential Pandas object by checking whether Pandas has been imported. Since this can be done by looking inside sys.modules, no imports are required. For more on this approach, see the databackend README.

Separating concerns with singledispatch

While databackend removes dependencies, the use of singledispatch from the built-in functools module separates out the logic for handling Polars DataFrames from the logic for Pandas DataFrames. This makes it easier to think one DataFrame at a time, and also gets us better type hinting.

Here’s a basic example, showing the get_column_names() function re-written using singledispatch:

from functools import singledispatch


# define the generic function ----
#
@singledispatch
def get_column_names(data) -> list[str]:
    raise TypeError(f"Unsupported type {type(data)}")


# register a pandas implementation on it ----
#
@get_column_names.register
def _(data: PdDataFrame):
    return data.columns.tolist()


# register a polars implementation on it ----
#
@get_column_names.register
def _(data: PlDataFrame):
    return data.columns

Note three important pieces:

The initial @singledispatch decorates def get_column_names(...). This creates a special “generic function”, which can define DataFrame specific implementations.
@get_column_names.register implements the Pandas DataFrame.
The use of PdDataFrame is what signifies “run this for Pandas DataFrames”.

With the get_column_names implementations defined, we can call it like a normal function:

get_column_names(df_pandas)  # pandas version
get_column_names(df_polars)  # polars version

['x', 'y']

For more on the benefits of singledispatch in data tooling, see the blog post Single Dispatch for Data Science Tools. For the nitty gritty on our DataFrame processing, see the Great Tables _tbl_data.py submodule.

See you in the Polarsverse

This was a long diversion into the strategy behind supporting both Pandas and Polars, but the results are worth it. Users are able to bring their DataFrame of choice without the collective baggage of every DataFrame option.

For more on the special things you can do with Polars expressions, see these resources:

Guide: basic styling using Polars expressions
Post: Great Tables, the Polars DataFrame Styler of Your Dreams
The narwhals library: a neat library for running Polars expressions on Pandas DataFrames.

Hope you make some stylish, publication ready tables!