ydot

ydot logo.

ydot is a Python API to produce PySpark dataframe models from R-like formula expressions. This project is based on patsy [pat]. As a quickstart, let’s say you have a Spark dataframe with data as follows.

Dummy Data in a Spark Dataframe

a

b

x1

x2

y

left

low

19.945536387662504

3.85214120038979

0.0

left

low

20.674308066353493

4.098585619118175

1.0

right

high

20.346647025958433

2.7107604387194626

1.0

right

mid

18.699653829045985

5.2111542692543065

1.0

left

low

21.51851187887476

2.432390426907621

1.0

right

mid

20.989823705535017

3.6774523253171734

1.0

right

high

20.277680897136328

2.4873300559969604

0.0

right

mid

19.551410645704927

2.3549674965407372

0.0

right

low

20.96196624352397

3.1665930443154995

0.0

right

mid

19.172421360793678

3.562224297579924

1.0

Now, let’s say you want to model this dataset as follows.

  • y ~ x_1 + x_2 + a + b

Then all you have to do is use the smatrices() function.

1
2
3
4
from ydot.spark import smatrices

formula = 'y ~ x1 + x2 + a + b'
y, X = smatrices(formula, sdf)

Observe that y and X will be Spark dataframes as specified by the formula. Here’s a more interesting example where you want a model specified up to all two-way interactions.

  • y ~ (x1 + x2 + a + b)**2

Then you could issue the code as below.

1
2
3
4
from ydot.spark import smatrices

formula = 'y ~ (x1 + x2 + a + b)**2'
y, X = smatrices(formula, sdf)

Your resulting X Spark dataframe will look like the following.

Dummy Data Transformed by Formula

Intercept

a[T.right]

b[T.low]

b[T.mid]

a[T.right]:b[T.low]

a[T.right]:b[T.mid]

x1

x1:a[T.right]

x1:b[T.low]

x1:b[T.mid]

x2

x2:a[T.right]

x2:b[T.low]

x2:b[T.mid]

x1:x2

1.0

0.0

1.0

0.0

0.0

0.0

19.945536387662504

0.0

19.945536387662504

0.0

3.85214120038979

0.0

3.85214120038979

0.0

76.83302248278848

1.0

0.0

1.0

0.0

0.0

0.0

20.674308066353493

0.0

20.674308066353493

0.0

4.098585619118175

0.0

4.098585619118175

0.0

84.73542172597531

1.0

1.0

0.0

0.0

0.0

0.0

20.346647025958433

20.346647025958433

0.0

0.0

2.7107604387194626

2.7107604387194626

0.0

0.0

55.154885818557126

1.0

1.0

0.0

1.0

0.0

1.0

18.699653829045985

18.699653829045985

0.0

18.699653829045985

5.2111542692543065

5.2111542692543065

0.0

5.2111542692543065

97.44678088481062

1.0

0.0

1.0

0.0

0.0

0.0

21.51851187887476

0.0

21.51851187887476

0.0

2.432390426907621

0.0

2.432390426907621

0.0

52.341422295472896

1.0

1.0

0.0

1.0

0.0

1.0

20.989823705535017

20.989823705535017

0.0

20.989823705535017

3.6774523253171734

3.6774523253171734

0.0

3.6774523253171734

77.18907599391727

1.0

1.0

0.0

0.0

0.0

0.0

20.277680897136328

20.277680897136328

0.0

0.0

2.4873300559969604

2.4873300559969604

0.0

0.0

50.437285161362595

1.0

1.0

0.0

1.0

0.0

1.0

19.551410645704927

19.551410645704927

0.0

19.551410645704927

2.3549674965407372

2.3549674965407372

0.0

2.3549674965407372

46.04293658215565

1.0

1.0

1.0

0.0

1.0

0.0

20.96196624352397

20.96196624352397

20.96196624352397

0.0

3.1665930443154995

3.1665930443154995

3.1665930443154995

0.0

66.3780165019193

1.0

1.0

0.0

1.0

0.0

1.0

19.172421360793678

19.172421360793678

0.0

19.172421360793678

3.562224297579924

3.562224297579924

0.0

3.562224297579924

68.29646521485958

In general, what you get with patsy is what you get with ydot, however, there are exceptions. For example, the builtin functions such as standardize() and center() available with patsy will not work against Spark dataframes. Additionally, patsy allows for custom transforms, but such transforms (or user defined functions) must be visible. For now, only numpy-based transformed are allowed against continuous variables (or numeric columns).

API Documentation

Indices and tables

About

One-Off Coder logo.

One-Off Coder is an educational, service and product company. Please visit us online to discover how we may help you achieve life-long success in your personal coding career or with your company’s business goals and objectives.

Citation

@misc{oneoffcoder_ydot_2020,
title={ydot, R-like formulas for Spark Dataframes},
url={https://github.com/oneoffcoder/pyspark-formula},
author={Jee Vang},
year={2020},
month={Dec}}

Author

Jee Vang, Ph.D.

Help