Efficient feature selection with abess (R)
Aug 2021 by Francisco Juretig
Doing linear regression in R, is extremely easy: it is done via the lm() function. However, when we intend to use these models for prediction, their performance may not be superb. The main reason is that every variable used in the model will be kept.
That means that noisy variable that have a small correlation with the target will appear anyway. In some way the
lm() function will handle that by estimating a big variance for those coefficients. So for example an irrelevant variable may get a coefficient = 250 but
also a standard deviation of 500. Although that is great for the purposes of doing inference (analysis) on the model, its not that great for prediction purposes. It is possible that a much better approach would be to remove that variable altogether.
The LASSO approach allows us to do by punishing all coefficients by how big they are. It can be proven that the irrelevant features will be pushed to literally zero.
So in some way LASSO does the variable selection for us. The typical approach is to use cross validation (or having a test dataset) to determine what level of strength should the LASSO use. Should it push a lot, or just slightly? It's impossible to know a priori, that's
why we usually need to run different tests and see which level of "strength" we need to get the best prediction performance.
As you can imagine, this is a computationally intensive approach, specially for large problems. The abess package, recently released allows us to do that extremely efficiently.
Want this project? You can download it from here
Here we will use prython as our IDE since it allows us to separate our code in a nice way. It is a new R and Python IDE that allows you to put your code in panels that you can connect to each other. Inside each panel you can use any R or python code you would normally use
Just a quick overview, you can see the specific images below.
We first use the
generate.data() function to generate some dummy data. We generate 1000 observations, with 50
features, and only 3 relevant features.
That implies that only three coefficients should have a non-zero coefficient in a perfect ML model. Bear in mind this function is intended to generate synthetic data just for testing this package, and not really meant to be used outside it. The reason we get this table is that we can choose how many features we want ourselves.
How do we run it? Each panel has three running modes. You can see the blue carets next to the python button (this button is used to switch between R and python). The first running mode runs only one panel, the second one runs up to that panel (meaning everything that is connected to IN) and the third one runs everything that is connected to OUT (meaning everything that uses the objects defined in here).
On the left we have can see the panel where we fit the
abess function. We are printing the dev (deviance) and GIC for each level of features.
Typically, we want to choose the number of features where the GIC gets minimised (in this case = 3).
Here we can see the three non-zero coefficients.
In the top panel (center) here we added a restriction. We are forcing the function to keep specifically the first feature. You can see it got assigned a non-zero coefficient (though it is very small) We could add as many as we want here. Not surprisingly, the 4 features here minimise the GIC (because we are forcing to keep an irrelevant one + the true relevant ones)
We also added two extra panels on the right. Here we specify the support sizes that we want to test (3 and 4). In the panel below, we can see the coefficients. This obviously makes a huge (time) difference when we have a big dataset with hundreds of features.