ydot

ydot logo.

ydot is a Python API to produce PySpark dataframe models from R-like formula expressions. This project is based on patsy [pat]. As a quickstart, let’s say you have a Spark dataframe with data as follows.

Dummy Data in a Spark Dataframe

a

b

x1

x2

y

left

low

19.945536387662504

3.85214120038979

0.0

left

low

20.674308066353493

4.098585619118175

1.0

right

high

20.346647025958433

2.7107604387194626

1.0

right

mid

18.699653829045985

5.2111542692543065

1.0

left

low

21.51851187887476

2.432390426907621

1.0

right

mid

20.989823705535017

3.6774523253171734

1.0

right

high

20.277680897136328

2.4873300559969604

0.0

right

mid

19.551410645704927

2.3549674965407372

0.0

right

low

20.96196624352397

3.1665930443154995

0.0

right

mid

19.172421360793678

3.562224297579924

1.0

Now, let’s say you want to model this dataset as follows.

  • y ~ x_1 + x_2 + a + b

Then all you have to do is use the smatrices() function.

1
2
3
4
from ydot.spark import smatrices

formula = 'y ~ x1 + x2 + a + b'
y, X = smatrices(formula, sdf)

Observe that y and X will be Spark dataframes as specified by the formula. Here’s a more interesting example where you want a model specified up to all two-way interactions.

  • y ~ (x1 + x2 + a + b)**2

Then you could issue the code as below.

1
2
3
4
from ydot.spark import smatrices

formula = 'y ~ (x1 + x2 + a + b)**2'
y, X = smatrices(formula, sdf)

Your resulting X Spark dataframe will look like the following.

Dummy Data Transformed by Formula

Intercept

a[T.right]

b[T.low]

b[T.mid]

a[T.right]:b[T.low]

a[T.right]:b[T.mid]

x1

x1:a[T.right]

x1:b[T.low]

x1:b[T.mid]

x2

x2:a[T.right]

x2:b[T.low]

x2:b[T.mid]

x1:x2

1.0

0.0

1.0

0.0

0.0

0.0

19.945536387662504

0.0

19.945536387662504

0.0

3.85214120038979

0.0

3.85214120038979

0.0

76.83302248278848

1.0

0.0

1.0

0.0

0.0

0.0

20.674308066353493

0.0

20.674308066353493

0.0

4.098585619118175

0.0

4.098585619118175

0.0

84.73542172597531

1.0

1.0

0.0

0.0

0.0

0.0

20.346647025958433

20.346647025958433

0.0

0.0

2.7107604387194626

2.7107604387194626

0.0

0.0

55.154885818557126

1.0

1.0

0.0

1.0

0.0

1.0

18.699653829045985

18.699653829045985

0.0

18.699653829045985

5.2111542692543065

5.2111542692543065

0.0

5.2111542692543065

97.44678088481062

1.0

0.0

1.0

0.0

0.0

0.0

21.51851187887476

0.0

21.51851187887476

0.0

2.432390426907621

0.0

2.432390426907621

0.0

52.341422295472896

1.0

1.0

0.0

1.0

0.0

1.0

20.989823705535017

20.989823705535017

0.0

20.989823705535017

3.6774523253171734

3.6774523253171734

0.0

3.6774523253171734

77.18907599391727

1.0

1.0

0.0

0.0

0.0

0.0

20.277680897136328

20.277680897136328

0.0

0.0

2.4873300559969604

2.4873300559969604

0.0

0.0

50.437285161362595

1.0

1.0

0.0

1.0

0.0

1.0

19.551410645704927

19.551410645704927

0.0

19.551410645704927

2.3549674965407372

2.3549674965407372

0.0

2.3549674965407372

46.04293658215565

1.0

1.0

1.0

0.0

1.0

0.0

20.96196624352397

20.96196624352397

20.96196624352397

0.0

3.1665930443154995

3.1665930443154995

3.1665930443154995

0.0

66.3780165019193

1.0

1.0

0.0

1.0

0.0

1.0

19.172421360793678

19.172421360793678

0.0

19.172421360793678

3.562224297579924

3.562224297579924

0.0

3.562224297579924

68.29646521485958

In general, what you get with patsy is what you get with ydot, however, there are exceptions. For example, the builtin functions such as standardize() and center() available with patsy will not work against Spark dataframes. Additionally, patsy allows for custom transforms, but such transforms (or user defined functions) must be visible. For now, only numpy-based transformed are allowed against continuous variables (or numeric columns).

Quickstart

Basic

The best way to learn R-style formula syntax with ydot is to head on over to patsy [pat] and read the documentation. Below, we show very simple code to transform a Spark dataframe into two design matrices (these are also Spark dataframes), y and X, using a formula that defines a model up to two-way interactions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import random
from random import choice

import numpy as np
import pandas as pd
from pyspark.sql import SparkSession

from ydot.spark import smatrices

random.seed(37)
np.random.seed(37)


def get_spark_dataframe(spark):
    n = 100
    data = {
        'a': [choice(['left', 'right']) for _ in range(n)],
        'b': [choice(['high', 'mid', 'low']) for _ in range(n)],
        'x1': np.random.normal(20, 1, n),
        'x2': np.random.normal(3, 1, n),
        'y': [choice([1.0, 0.0]) for _ in range(n)]
    }
    pdf = pd.DataFrame(data)

    sdf = spark.createDataFrame(pdf)
    return sdf


if __name__ == '__main__':
    try:
        spark = (SparkSession.builder
                 .master('local[4]')
                 .appName('local-testing-pyspark')
                 .getOrCreate())
        sdf = get_spark_dataframe(spark)

        y, X = smatrices('y ~ (x1 + x2 + a + b)**2', sdf)
        y = y.toPandas()
        X = X.toPandas()

        print(X.head(10))
        X.head(10).to_csv('two-way-interactions.csv', index=False)
    except Exception as e:
        print(e)
    finally:
        try:
            spark.stop()
            print('closed spark')
        except Exception as e:
            print(e)

More

We use the code below to generate the models (data) below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import random
from random import choice

import numpy as np
import pandas as pd
from pyspark.sql import SparkSession

from ydot.spark import smatrices

random.seed(37)
np.random.seed(37)


def get_spark_dataframe(spark):
    n = 100
    data = {
        'a': [choice(['left', 'right']) for _ in range(n)],
        'b': [choice(['high', 'mid', 'low']) for _ in range(n)],
        'x1': np.random.normal(20, 1, n),
        'x2': np.random.normal(3, 1, n),
        'y': [choice([1.0, 0.0]) for _ in range(n)]
    }
    pdf = pd.DataFrame(data)

    sdf = spark.createDataFrame(pdf)
    return sdf


if __name__ == '__main__':
    try:
        spark = (SparkSession.builder
                 .master('local[4]')
                 .appName('local-testing-pyspark')
                 .getOrCreate())
        sdf = get_spark_dataframe(spark)

        formulas = [
            {
                'f': 'y ~ np.sin(x1) + np.cos(x2) + a + b',
                'o': 'transformed-continuous.csv'
            },
            {
                'f': 'y ~ x1*x2',
                'o': 'star-con-interaction.csv'
            },
            {
                'f': 'y ~ a*b',
                'o': 'star-cat-interaction.csv'
            },
            {
                'f': 'y ~ x1:x2',
                'o': 'colon-con-interaction.csv'
            },
            {
                'f': 'y ~ a:b',
                'o': 'colon-cat-interaction.csv'
            },
            {
                'f': 'y ~ (x1 + x2) / (a + b)',
                'o': 'divide-interaction.csv'
            },
            {
                'f': 'y ~ x1 + x2 + a - 1',
                'o': 'no-intercept.csv'
            }
        ]

        for item in formulas:
            f = item['f']
            o = item['o']

            y, X = smatrices(f, sdf)
            y = y.toPandas()
            X = X.toPandas()

            X.head(5).to_csv(o, index=False)

            s = f"""
            .. csv-table:: {f}
               :file: _code/{o}
               :header-rows: 1
            """
            print(s.strip())
    except Exception as e:
        print(e)
    finally:
        try:
            spark.stop()
            print('closed spark')
        except Exception as e:
            print(e)

You can use numpy functions against continuous variables.

y ~ np.sin(x1) + np.cos(x2) + a + b

Intercept

a[T.right]

b[T.low]

b[T.mid]

np.sin(x1)

np.cos(x2)

1.0

0.0

1.0

0.0

0.8893769205406579

-0.758004200582313

1.0

0.0

1.0

0.0

0.9679261582216445

-0.5759807266894401

1.0

1.0

0.0

0.0

0.9972849995254774

-0.9086185088676886

1.0

1.0

0.0

1.0

-0.14934132364604816

0.4783416124776783

1.0

0.0

1.0

0.0

0.45523550315103734

-0.7588816501987654

The * specifies interactions and keeps lower order terms.

y ~ x1*x2

Intercept

x1

x2

x1:x2

1.0

19.945536387662504

3.85214120038979

76.83302248278848

1.0

20.674308066353493

4.098585619118175

84.73542172597531

1.0

20.346647025958433

2.7107604387194626

55.154885818557126

1.0

18.699653829045985

5.2111542692543065

97.44678088481062

1.0

21.51851187887476

2.432390426907621

52.341422295472896

y ~ a*b

Intercept

a[T.right]

b[T.low]

b[T.mid]

a[T.right]:b[T.low]

a[T.right]:b[T.mid]

1.0

0.0

1.0

0.0

0.0

0.0

1.0

0.0

1.0

0.0

0.0

0.0

1.0

1.0

0.0

0.0

0.0

0.0

1.0

1.0

0.0

1.0

0.0

1.0

1.0

0.0

1.0

0.0

0.0

0.0

The : specifies interactions and drops lower order terms.

y ~ x1:x2

Intercept

x1:x2

1.0

76.83302248278848

1.0

84.73542172597531

1.0

55.154885818557126

1.0

97.44678088481062

1.0

52.341422295472896

y ~ a:b

Intercept

b[T.low]

b[T.mid]

a[T.right]:b[high]

a[T.right]:b[low]

a[T.right]:b[mid]

1.0

1.0

0.0

0.0

0.0

0.0

1.0

1.0

0.0

0.0

0.0

0.0

1.0

0.0

0.0

1.0

0.0

0.0

1.0

0.0

1.0

0.0

0.0

1.0

1.0

1.0

0.0

0.0

0.0

0.0

The / is quirky according to the patsy documentation, but it is shorthand for a / b = a + a:b.

y ~ (x1 + x2) / (a + b)

Intercept

x1

x2

x1:x2:a[left]

x1:x2:a[right]

x1:x2:b[T.low]

x1:x2:b[T.mid]

1.0

19.945536387662504

3.85214120038979

76.83302248278848

0.0

76.83302248278848

0.0

1.0

20.674308066353493

4.098585619118175

84.73542172597531

0.0

84.73542172597531

0.0

1.0

20.346647025958433

2.7107604387194626

0.0

55.154885818557126

0.0

0.0

1.0

18.699653829045985

5.2111542692543065

0.0

97.44678088481062

0.0

97.44678088481062

1.0

21.51851187887476

2.432390426907621

52.341422295472896

0.0

52.341422295472896

0.0

If you need to drop the Intercept, add - 1 at the end. Note that one of the dummy variables for a is not dropped. This could be a bug with patsy.

y ~ x1 + x2 + a - 1

a[left]

a[right]

x1

x2

1.0

0.0

19.945536387662504

3.85214120038979

1.0

0.0

20.674308066353493

4.098585619118175

0.0

1.0

20.346647025958433

2.7107604387194626

0.0

1.0

18.699653829045985

5.2111542692543065

1.0

0.0

21.51851187887476

2.432390426907621

Bibliography

pat

patsy. Patsy - describing statistical models in python. URL: https://patsy.readthedocs.io/en/latest/index.html.

PySpark Formula

Formula

The formula module contains code to extract values from a record (e.g. a Spark dataframe Record) based on the model definition.

class ydot.formula.CatExtractor(record, term)

Bases: ydot.formula.Extractor

Categorical extractor (no levels).

__init__(record, term)

ctor.

Parameters
  • record – Dictionary.

  • term – Model term.

Returns

None.

property value

Gets the extracted value.

class ydot.formula.ConExtractor(record, term)

Bases: ydot.formula.Extractor

Continuous extractor (no functions).

__init__(record, term)

ctor.

Parameters
  • record – Dictionary.

  • term – Model term.

Returns

None.

property value

Gets the extracted value.

class ydot.formula.Extractor(record, term, term_type)

Bases: abc.ABC

Extractor to get value based on model term.

__init__(record, term, term_type)

ctor.

Param

Dictionary.

Term

Model term.

Term_type

Type of term.

Returns

None

abstract property value

Gets the extracted value.

class ydot.formula.FunExtractor(record, term)

Bases: ydot.formula.Extractor

Continuous extractor (with functions defined).

__init__(record, term)

ctor.

Parameters
  • record – Dictionary.

  • term – Model term.

Returns

None.

property value

Gets the extracted value.

class ydot.formula.IntExtractor(record, term)

Bases: ydot.formula.Extractor

Intercept extractor. Always returns 1.0.

__init__(record, term)

ctor.

Parameters
  • record – Dictionary.

  • term – Model term.

Returns

None.

property value

Gets the extracted value.

class ydot.formula.InteractionExtractor(record, terms)

Bases: object

Interaction extractor for interaction effects.

__init__(record, terms)

ctor.

Parameters
  • record – Dictionary.

  • terms – Model term (possibly with interaction effects).

Returns

None.

property value
class ydot.formula.LvlExtractor(record, term)

Bases: ydot.formula.Extractor

Categorical extractor (with levels).

__init__(record, term)

ctor.

Parameters
  • record – Dictionary.

  • term – Model term.

Returns

None.

property value

Gets the extracted value.

class ydot.formula.TermEnum(value)

Bases: enum.IntEnum

Term types.

  • CAT: categorical without levels specified

  • LVL: categorical with levels specified

  • CON: continuous

  • FUN: continuous with function transformations

  • INT: intercept

CAT = 1
CON = 3
FUN = 4
INT = 5
LVL = 2
static get_extractor(record, term)

Gets the associated extractor based on the specified term.

Parameters
  • record – Dictionary.

  • term – Model term.

Returns

Extractor.

Spark

The spark module contains code to transform a Spark dataframe into design matrices as specified by a formula.

ydot.spark.get_columns(formula, sdf, profile=None)

Gets the expanded columns of the specified Spark dataframe using the specified formula.

Parameters
  • formula – Formula (R-like, based on patsy).

  • sdf – Spark dataframe.

  • profile – Profile. Default is None and profile will be determined empirically.

Returns

Tuple of columns for y, X.

ydot.spark.get_profile(sdf)

Gets the field profiles of the specified Spark dataframe.

Parameters

sdf – Spark dataframe.

Returns

Dictionary.

ydot.spark.smatrices(formula, sdf, profile=None)

Gets tuple of design/model matrices.

Parameters
  • formula – Formula.

  • sdf – Spark dataframe.

  • profile – Dictionary of data profile.

Returns

y, X Spark dataframes.

Indices and tables

About

One-Off Coder logo.

One-Off Coder is an educational, service and product company. Please visit us online to discover how we may help you achieve life-long success in your personal coding career or with your company’s business goals and objectives.

Citation

@misc{oneoffcoder_ydot_2020,
title={ydot, R-like formulas for Spark Dataframes},
url={https://github.com/oneoffcoder/pyspark-formula},
author={Jee Vang},
year={2020},
month={Dec}}

Author

Jee Vang, Ph.D.

  • Patreon: support is appreciated

  • GitHub: sponsorship will help us change the world for the better

Help