Saturday, October 7, 2017

Simple Bayesian modeling by Stan

Overview

About Bayesian modeling, we can use some languages and tools. BUGS, PyMC, Stan. On this article, I made simple regression model by using Stan from Python.




Data

To try simple regression, I used the data set, Speed and Stopping Distances of Cars. R has the data set as default. So to get the data, type the following line on R console.

write.csv(cars, "cars.csv", row.names=FALSE)

On Python, we can read the data.

import pandas as pd

cars = pd.read_csv('cars.csv')
print(cars)
    speed  dist
0       4     2
1       4    10
2       7     4
3       7    22
4       8    16
5       9    10
6      10    18
7      10    26
8      10    34
9      11    17
10     11    28
11     12    14
12     12    20
13     12    24
14     12    28
15     13    26
16     13    34
17     13    34
18     13    46
19     14    26
20     14    36
21     14    60
22     14    80
23     15    20
24     15    26
25     15    54
26     16    32
27     16    40
28     17    32
29     17    40
30     17    50
31     18    42
32     18    56
33     18    76
34     18    84
35     19    36
36     19    46
37     19    68
38     20    32
39     20    48
40     20    52
41     20    56
42     20    64
43     22    66
44     23    54
45     24    70
46     24    92
47     24    93
48     24   120
49     25    85

As you can see, this data has just two columns, speed and dist. I used “speed” as explaining variable and “dist” as explained variable.

Stan modeling


When we use Stan, we need to write the model and save it to stan file. I wrote the following model and saved it to “cars.stan”.

data{
    int N;
    real X[N];
    real Y[N];
}

parameters{
    real a;
    real b;
    real<lower=0> sigma;
}

model {
    for (i in 1:N){
        Y[i] ~ normal(a * X[i] + b, sigma);
    }
}

This has three parts, data, parameters and model. Simply, the roles of each part are as followings.
  • data: The data which is used on the model part
  • parameters: The target to estimate
  • model: The model

Intuitively, the data part is similar to placeholder on TensorFlow from the viewpoint of the code.

On this case, the model means that Y follows normal distribution whose mean is the linear combination, and standard deviation is sigma.

Execute


To calculate and get sampled points, we need to pass the data to the model on Python.(Actually, you can also use R.)

You can install pystan by pip.

pip install pystan

It is easy to pass the data to the model.

import pystan
data = {'X':cars['speed'], 'Y':cars['dist'], 'N':cars.shape[0]}
fit = pystan.stan(file='cars.stan', data=data, seed=42)

The code above executes the calculation.
The variable, fit, has summary of the sampled points.

print(fit)
Inference for Stan model: anon_model_a754b000381b0bd871cba7120173c0e4.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
a       3.94    0.01   0.43    3.1   3.65   3.93   4.22   4.81   1437    1.0
b     -17.61    0.19   7.06 -31.57 -22.19 -17.54 -13.01   -4.0   1443    1.0
sigma  15.83    0.04   1.68  12.98  14.66  15.68  16.83  19.55   2174    1.0
lp__  -159.4    0.04    1.3 -162.7 -160.0 -159.1 -158.5 -158.0   1178    1.0

Samples were drawn using NUTS at Sat Oct  7 08:22:10 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1).
When you want to get the plot of the sampled points. Simply, you can use the function, plot().

fit.plot()
enter image description here

By using the estimated values, let’s draw the linear regression line.

import matplotlib.pyplot as plt
import numpy as np

samples = fit.extract(permuted=True)
x = np.linspace(0, max(cars['speed']))
y = samples['a'].mean() * x + samples['b'].mean()

plt.figure()
plt.plot(cars['speed'], cars['dist'], "o")
plt.plot(x, y, "r-")
plt.show()
enter image description here

Related Posts


Bayesian multiple regression by Stan

On the article, Simple Bayesian modeling by Stan, I made a simple linear regression by Stan and PyStan. So, as an extension of it, I made multiple regression model on the same manner to show how to do Bayesian modeling roughly. On this article, I used air quality data set, which R has as a default.


Reference