zoofs - variable selection (python)
Aug 2021 by Francisco Juretig
Feature selection is an important part of every ML pipeline. There are many interesting libraries out there, but zoofs stands out due to the sophistication used in its algorithms. It is meant to be used after we select the hyperparameters for a ML model. Let's review how we do ML. For any ML regression or classification problem, we need to do four things:
- choose the model type
- tune the hyper-parameters
- choose the features
- use the model for prediction
zoofs is meant to help us in (3). In this example, we will work with a dummy dataset with 5 good features, 5 useless features (just noise), and a 0/1 target. The objective will be to evaluate how well the different algorithms implemented in zoofs work. We will first load the dataset via pandas, build the test/train sets, optimise the hyper-parameters for a Random Forest, then feed this estimator into 4 different algorithms implemented in zoofs.
Want this project? You can download it from here (contains the prython project plus the csv input file)
Here we will use prython as our IDE since it allows us to separate our code in a nice way. It is a new R and Python IDE that allows you to put your code in panels that you can connect to each other. Inside each panel you can use any R or python code you would normally use
Just a quick overview, you can see the specific images below.
Here we load our data from a csv file. The dataframe that we loaded is named "data". It has NUM1-NUM5 as relevant features, and rand1-rand5 as irrelevant (they are just noise, and a good variable selection algorithm should discard them). The target is IS_VIP_Client which is a 1/0 variable (in consequence we will work with a classification model).
Note that we connected this panel to a panel below that actually trains a Random Forest model. Here we find the best hyper-parameters using the well-known GridSearchCV function. This is because we need to first determine the hyper-parameters before we do variable selection.
How do we run it? Each panel has three running modes. You can see the blue carets next to the python button (this button is used to switch between R and python). The first running mode runs only one panel, the second one runs up to that panel (meaning everything that is connected to IN) and the third one runs everything that is connected to OUT (meaning everything that uses the objects defined in here).
After the hyperparameter optimization, we need to define the objective function that we will optimize. At the end, the zoofs library is an optimization one. We need to build a function that returns something to be optimized. In our case, we will maximize the area under the ROC curve for the test data.
We will test 4 algorithms. GreyWolf, ParticleSwarm, a genetic algorithm, and the dragonfly algorithm. In order to run this, we just go to the panel at the top, and click on >> in the panel. That runs everything that is connected to OUT, which in our case is everything.
Here we can see two other algorithms
We can click on the 7th icon from the left on each panel to create a replica of the output (linked in real time). We can then move these relicas next to each other as we did here. This makes it easy to compare multiple models very easily. As we can see ParticleSwarm got the best AUC, followed by DragonFly. Both the genetic algorithm and GreyWolf didn't perfom well, as they ended up keeping all the variables.
Remember that a good algorithm here should keep the 5 relevant variables, and throw out the other irrelevant ones.