Robust regression (python)
Aug 2021 by Francisco Juretig
The objective of linear regression is to find the coefficients: B1,B2,...,Bk of the following problem
Y = B1 * x1 + B2 * x2 + Bk * xk + error.
It's in essence an optimization problem where we find these coefficients that minimise the sum of the square residuals (difference between the predictions and the observed values). It is called OLS (ordinary least squares).
Under quite general conditions, these coefficients exist, are unique and we can obtain confidence intervals on them. However there is an accounted problem that could affect the quality of what we are doing and it is the presence of outliers. These
abnormal values can cause our coefficients to change by a lot. This is not good, as it implies that our model really depends on a few cases that might very well be just exceptions.
In order to do this exercise we will use the statsmodels library in python.
By the way, you can find a nice intro to robust regression on here (in R) .
In this case, we will run an example where we test several OLS models in python using the
lm() function. We add different levels of contamination to the data (we just grab a few cases and replace them by abnormal values) and we test how well the
rlm functions work. As we will see, even a few cases totally corrupt the
lm model, whereas the
rlm function yields almost the same results. This
rlm function, has essentially one main parameter, where we specify the kernel
In this case, we will run an example where we test several
OLS() models in python using the
OLS() function from statsmodels . We add different levels of contamination to the data (we just grab a few cases and replace them by abnormal values) and we test how well the
OLS() and the
RLM() functions work. As we will see, even a few cases totally corrupt the OLS model, whereas the RLM function yields almost the same results. This
RLM() function, has essentially one main parameter, where we specify the kernel for it.
Here we will use prython as our IDE since it allows us to separate our code in a nice way. It is a new R and Python IDE that allows you to put your code in panels that you can connect to each other. Inside each panel you can use any R or python code you would normally use
We separated the project into two parts. Here on the right we have two panels. On the top one we are defining our dataset (the famous diabetes dataset from sklearn), Note that we are just keeping one feature to make things simpler. On the lower panel we are using the OLS model from statsmodels. Pretty simple right? Pay attention to the x1 coefficient, which is here estimated at 949. Bear in mind there is no outliers here, nor any type of contamination
How do we run it? Each panel has three running modes. You can see the blue carets next to the python button (this button is used to switch between R and python). The first running mode runs only one panel, the second one runs up to that panel (meaning everything that is connected to IN) and the third one runs everything that is connected to OUT (meaning everything that uses the objects defined in here).
On this left side, we have a top panel where we (again) load the diabetes dataset. We also add three outliers and put some very big values there. Note that we have two panels connected here. On the left we have a standard regression, and on the right we have a robust regression model. Let's see what happens in the next panel.
Let's look at the standard regression. As we can see the x1 coefficient is now 787 which is radically different from what we had before the contamination. It just took three observations to completely corrupt the model.
Let's look at the robust regression. Here we get an x1 coefficient estimated at 975 which is very close to the original OLS estimate we had. The outliers, didn't cause the coefficient to move.
Prefer a video?
Here we use rlm in python using statsmodels
You will also find the code, project and files that you can download from a github repo.https://github.com/fjuretig/amazing_data_science_projects.git