Efficient feature selection with abess (R)

Aug 2021 by Francisco Juretig

Doing linear regression in R, is extremely easy: it is done via the lm() function. However, when we intend to use these models for prediction, their performance may not be superb. The main reason is that every variable used in the model will be kept. That means that noisy variable that have a small correlation with the target will appear anyway. In some way the lm() function will handle that by estimating a big variance for those coefficients. So for example an irrelevant variable may get a coefficient = 250 but also a standard deviation of 500. Although that is great for the purposes of doing inference (analysis) on the model, its not that great for prediction purposes. It is possible that a much better approach would be to remove that variable altogether. The LASSO approach allows us to do by punishing all coefficients by how big they are. It can be proven that the irrelevant features will be pushed to literally zero. So in some way LASSO does the variable selection for us. The typical approach is to use cross validation (or having a test dataset) to determine what level of strength should the LASSO use. Should it push a lot, or just slightly? It's impossible to know a priori, that's why we usually need to run different tests and see which level of "strength" we need to get the best prediction performance.

As you can imagine, this is a computationally intensive approach, specially for large problems. The abess package, recently released allows us to do that extremely efficiently.

Short video

Just a quick overview, you can see the specific images below.

We first use the generate.data() function to generate some dummy data. We generate 1000 observations, with 50 features, and only 3 relevant features.

That implies that only three coefficients should have a non-zero coefficient in a perfect ML model. Bear in mind this function is intended to generate synthetic data just for testing this package, and not really meant to be used outside it. The reason we get this table is that we can choose how many features we want ourselves.

On the left we have can see the panel where we fit the abess function. We are printing the dev (deviance) and GIC for each level of features. Typically, we want to choose the number of features where the GIC gets minimised (in this case = 3).

Here we can see the three non-zero coefficients.

In the top panel (center) here we added a restriction. We are forcing the function to keep specifically the first feature. You can see it got assigned a non-zero coefficient (though it is very small) We could add as many as we want here. Not surprisingly, the 4 features here minimise the GIC (because we are forcing to keep an irrelevant one + the true relevant ones)

We also added two extra panels on the right. Here we specify the support sizes that we want to test (3 and 4). In the panel below, we can see the coefficients. This obviously makes a huge (time) difference when we have a big dataset with hundreds of features.