import pandas as pd
import polars as pl
from great_tables import GT
= pd.DataFrame({"x": ["a", "b"], "y": [1.01, 2.0]})
df_pandas = pl.from_pandas(df_pandas)
df_polars
GT(df_pandas)
x | y |
---|---|
a | 1.01 |
b | 2.0 |
Michael Chow
April 24, 2024
A few months ago, we released a blog post about how much we loved the combination of Polars and Great Tables. We found that Polars lazy expression system opened up convenient ways to conditionally format tables for presentation. However, excited as we were, we were harboring a shameful secret: Great Tables enabled Polars as an optional dependency, but had a hard dependency on the alternative DataFrame library Pandas.
We’re happy to share that Great Tables v0.5.0 makes Pandas an optional dependency. Using Pandas DataFrames as inputs is still fully supported. The optional dependency simply allows users of one DataFrame library to not have to install the other.
In this post, I’ll cover three important pieces:
This may seem over the top, but many DataFrame implementations exist in the Python world. Enabling folks to BYODF (Bring Your Own DataFrame) is a tricky, rewarding challenge!
Currently, Great Tables has two “sizes” of libraries it depends on:
For small utilities, we depend on Babel, which makes it easier to say things like, “format this number as if I’m in Germany”.
For big dependencies, numpy should be fairly straightforward to remove (see this issue). We also still rely on Pandas for datasets in great_tables.data
, but we will remove it soon (see this issue).
Removing dependencies like numpy and Pandas helps people who are in restricted computing environments, want a more lightweight install, or who are stuck depending on a much earlier version of a package. It also helps us keep a clean separation of concerns. Without clear boundaries, it’s too tempting to reach for things like pd.isna()
in a pinch, or smear library specific versions of missingness across our code (e.g. pd.NA
, np.nan
, polars.NullType
).
Removing hard dependencies on DataFrame libraries is worthwhile, but requires special handling for all DataFrame specific actions. To illustrate consider the Great Tables output below, which is produced from a Pandas DataFrame:
import pandas as pd
import polars as pl
from great_tables import GT
df_pandas = pd.DataFrame({"x": ["a", "b"], "y": [1.01, 2.0]})
df_polars = pl.from_pandas(df_pandas)
GT(df_pandas)
x | y |
---|---|
a | 1.01 |
b | 2.0 |
Producing this table includes two actions on the DataFrame:
While these actions may seem simple, they require different methods for different DataFrame implementations. In this post, we’ll focus specifically on the challenge of getting column names.
The code below shows the different methods required to get column names as a list from Pandas and Polars.
Notice that the two lines of code aren’t too different—Pandas just requires an extra .tolist()
piece. We could create a special function, that returns a list of names, depending on the type of the input DataFrame.
The function works great, in that we can call it on either DataFrame, but it lacks two things. The first is dependency inversion, since it requires importing both Pandas and Polars (creating a hard dependency). The second is separation of concerns, since Pandas and Polars code is mixed together. In this case adding more DataFrame implementations would create a hot stew of logic.
We were able to make Pandas optional in a sane manner through two moves:
isinstance
checks without importing anything.get_column_names()
into DataFrame specific versions.Inverting dependency on DataFrame libraries means that we check whether something is a specific type of DataFrame, without using imports. This is done through the package databackend
, which we copied into Great Tables.
It works by creating placeholder classes, which stand in for the DataFrames they’re detecting:
from great_tables._databackend import AbstractBackend
class PdDataFrame(AbstractBackend):
_backends = [("pandas", "DataFrame")]
class PlDataFrame(AbstractBackend):
_backends = [("polars", "DataFrame")]
if isinstance(df_pandas, PdDataFrame):
print("I'm a pandas DataFrame!!!")
I'm a pandas DataFrame!!!
Note that the PdDataFrame
above is able to detect a Pandas DataFrame without importing Pandas, by taking advantage of a bit of logic called a counterfactual:
df_pandas
is a Pandas DataFrame, then Pandas has been imported.df_pandas
is not a Pandas DataFrame.This lets it quickly rule out a potential Pandas object by checking whether Pandas has been imported. Since this can be done by looking inside sys.modules
, no imports are required. For more on this approach, see the databackend README.
While databackend removes dependencies, the use of singledispatch from the built-in functools
module separates out the logic for handling Polars DataFrames from the logic for Pandas DataFrames. This makes it easier to think one DataFrame at a time, and also gets us better type hinting.
Here’s a basic example, showing the get_column_names()
function re-written using singledispatch:
from functools import singledispatch
# define the generic function ----
#
@singledispatch
def get_column_names(data) -> list[str]:
raise TypeError(f"Unsupported type {type(data)}")
# register a pandas implementation on it ----
#
@get_column_names.register
def _(data: PdDataFrame):
return data.columns.tolist()
# register a polars implementation on it ----
#
@get_column_names.register
def _(data: PlDataFrame):
return data.columns
Note three important pieces:
@singledispatch
decorates def get_column_names(...)
. This creates a special “generic function”, which can define DataFrame specific implementations.@get_column_names.register
implements the Pandas DataFrame.PdDataFrame
is what signifies “run this for Pandas DataFrames”.With the get_column_names
implementations defined, we can call it like a normal function:
['x', 'y']
For more on the benefits of singledispatch in data tooling, see the blog post Single Dispatch for Data Science Tools. For the nitty gritty on our DataFrame processing, see the Great Tables _tbl_data.py
submodule.
This was a long diversion into the strategy behind supporting both Pandas and Polars, but the results are worth it. Users are able to bring their DataFrame of choice without the collective baggage of every DataFrame option.
For more on the special things you can do with Polars expressions, see these resources:
Hope you make some stylish, publication ready tables!