Getting Started
Install OrbitalML:
Prepare some data:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
COLUMNS = ["sepal.length", "sepal.width", "petal.length", "petal.width"]
iris = load_iris(as_frame=True)
iris_x = iris.data.set_axis(COLUMNS, axis=1)
# SQL and OrbitalML don't like dots in column names, replace them with underscores
iris_x.columns = COLUMNS = [cname.replace(".", "_") for cname in COLUMNS]
X_train, X_test, y_train, y_test = train_test_split(
iris_x, iris.target, test_size=0.2, random_state=42
)
Define a Scikit-Learn pipeline and train it:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline(
[
("preprocess", ColumnTransformer([("scaler", StandardScaler(with_std=False), COLUMNS)],
remainder="passthrough")),
("linear_regression", LinearRegression()),
]
)
pipeline.fit(X_train, y_train)
Convert the pipeline to OrbitalML:
import orbitalml
import orbitalml.types
orbitalml_pipeline = orbitalml.parse_pipeline(pipeline, features={
"sepal_length": orbitalml.types.DoubleColumnType(),
"sepal_width": orbitalml.types.DoubleColumnType(),
"petal_length": orbitalml.types.DoubleColumnType(),
"petal_width": orbitalml.types.DoubleColumnType(),
})
You can print the pipeline to see the result:
>>> print(orbitalml_pipeline)
ParsedPipeline(
features={
sepal_length: DoubleColumnType()
sepal_width: DoubleColumnType()
petal_length: DoubleColumnType()
petal_width: DoubleColumnType()
},
steps=[
merged_columns=Concat(
inputs: sepal_length, sepal_width, petal_length, petal_width,
attributes:
axis=1
)
variable1=Sub(
inputs: merged_columns, Su_Subcst=[5.809166666666666, 3.0616666666666665, 3.7266666666666666, 1.18333333...,
attributes:
)
multiplied=MatMul(
inputs: variable1, coef=[-0.11633479416518255, -0.05977785171980231, 0.25491374699772246, 0.5475959...,
attributes:
)
resh=Add(
inputs: multiplied, intercept=[0.9916666666666668],
attributes:
)
variable=Reshape(
inputs: resh, shape_tensor=[-1, 1],
attributes:
)
],
)
Now we can generate the SQL from the pipeline:
And check the resulting query:
>>> print(sql)
SELECT ("t0"."sepal_length" - 5.809166666666666) * -0.11633479416518255 + 0.9916666666666668 +
("t0"."sepal_width" - 3.0616666666666665) * -0.05977785171980231 +
("t0"."petal_length" - 3.7266666666666666) * 0.25491374699772246 +
("t0"."petal_width" - 1.1833333333333333) * 0.5475959809777828
AS "variable" FROM "DATA_TABLE" AS "t0"
Once the SQL is generate, you can use it to run the pipeline on a database. From here on the SQL can be exported and reused in other places:
>>> print("\nPrediction with SQL")
>>> duckdb.register("DATA_TABLE", X_test)
>>> print(duckdb.sql(sql).df()["variable"][:5].to_numpy())
Prediction with SQL
[ 1.23071715 -0.04010441 2.21970287 1.34966889 1.28429336]
We can verify that the prediction matches the one done by Scikit-Learn by running the scikitlearn pipeline on the same set of data: