Bayesian linear regression with STAN (R)
Aug 2021 by Francisco Juretig
The objective of classical linear regression is to find the coefficients (B1,...,Bk) of the following equation
Y = B1 * x1 + B2 * x2 + Bk * xk that minimise the square distance between the predictions of this model
and the data. This is usually obtained by using ordinary least squares (OLS), which can easily be done via the
A nice introduction to bayesian regression can be found here.
In Bayesian regression, we assume that the B1...Bk coefficients are not fixed values that we need to recover, but random variables themselves with some distribution. We will feed a prior distribution for each one (you can think of priors
as initial approximations/thoughts that we have) that we intend to update with the data. In consequence, we can think of this as a process where we put prior distributions in, they get mixed with the data, and we estimate the posterior distributions.
Let's put this in a simple example. For example, let's work with a pricing example. Let's assume that we want to estimate how much the sales change when we alter the price of a product.
Our model is simply:
sales = B1 * price
We can have a prior idea, let's assume that we think that typically for each $ that the price increases, we would sell 2 units less. We also would expect that the impact is at most 0 (meaning that the range would be -inf,0).
The reason for that is that we would expect the impact to be capped at 0: a price increase cannot increase the sales.
We may not be very sure about the exact shape of this distribution, but what I said should be sufficient to build a preliminary one. We call this a prior.
We then build a mathematical procedure to alter this prior based on the data. If the data tells us that for each $ that we spend, the sales go down by 10, most of the prior effect will be dilluted. If on the contrary,
we find out that for every $ that we spend, the sales go down by 2, then the prior effect will be reinforced.
But how does the math work? We need to use a procedure to estimate the distribution of these parameters taking the prior distributions into consideration. One of the most well known procedures to do this is the so called
Markov Chain Monte Carlo (MCMC). Out of the many Bayesian libraries out there, STAN is regarded as the best one. The idea will be consistent with what I described so far:
load the data and the priors, and obtain posterior distributions. The only really hard part will be analysing the outputs to make sure that the posteriors distributions are reliable. Basically these MCMC procedures generate a sequence
of random numbers for each parameter that behave as if the underlying distribution was the posterior density (this is important: we don't recover the shape of the distribution, we recover random numbers. Of course we can estimate a density out of them, but that's a different story).
Here we will use prython as our IDE since it allows us to separate our code in a nice way. It is a new R and Python IDE that allows you to put your code in panels that you can connect to each other. Inside each panel you can use any R or python code you would normally use
Here we load the data. The data_to_model has the raw data plus some transformations: in particular, we transformed the website column into three 0/1 dummies. (standard, red,blue) Our X dataframe has those three dummies, plus an extra for a coupon variable (this is also 0 or 1). The objective is to estimate the "Total" variable in terms of those 4 variables that are on the X dataframe.
This is where the magic happens: we first put the model into a string, and then we call stan. In data we just declare the dimensions of our dataset, in model we specify the model and the priors (note that
here we specify a linear one). We are using 3 loose priors centered at 100. Usually Gaussian ones (normal ones) are used if we suspect the prior to be symmetric.
When we define
stan_data we create a list of the data that we are passing. K is the number of variables, and N is the number of rows.
print() to print the results.
traceplot function will print the MCMC chains in different colours, for each parameter. What we want to see here is random noise, with no obvious structure (exactly what we see here).
For a much more detailed discussion about diagnostics see here.
Finally, we call
mcmc_interval to print the intervals for the means. Because we are generating posterior distributions, we need some metric to summarise them. Some people here use the mode, or the
Note that here, we didn't specify boundaries for the priors, and we assume they take the full -inf,inf range. We could potentially add them, for example a distribution bounded between 0 and 1, or 0 and +inf.
How do we run it? Each panel has three running modes. You can see the blue carets next to the python button (this button is used to switch between R and python). The first running mode runs only one panel, the second one runs up to that panel (meaning everything that is connected to IN) and the third one runs everything that is connected to OUT (meaning everything that uses the objects defined in here).
One extra thing that we can do, is to plot the posterior densities. As we can see, they look similar to the normal ones we loaded as priors, but got displaced to different places. One conclusion is that beta which corresponds to the "blue" effect, has most likely a greater impact than the other two.
How do we interpret the results? The posterior distributions (same as the priors) can be interpreted as a subjective probabilistic assessment on where we think a parameter lies (what values are possible, and with what associated probabilities).
Prefer a video?
Here we use RStan to do linear bayesian regression
You will also find the code, project and files that you can download from a github repo.https://github.com/fjuretig/amazing_data_science_projects.git