Choosing models via pycaret (python)
Aug 2021 by Francisco Juretig
Almost every time we build an ML model, we need to choose four things: the features, the transformations to those features, the model itself, and the hyper-parameters for that model. Almost invariably, this involves using scikit-learn or a similar library, and doing some manual work to compare and determine which models work well. pycaret is specifically designed to solve this in an automatic way. To be more specific, pycaret can create a table comparing multiple models sorted by whatever metric we want. Of course these models use the best hyper-parameters found automatically, and we can define all sorts of feature transformations to be done automatically (for example: thresholding, normalization, one-hot-encoding, combining levels with few cases, etc). On top of that, it also does automatic feature selection. In other words, we just pass a table containing the data, and the target and pycaret will do everything for us. Doesn't get any easier!
Here we will use prython as our IDE since it allows us to separate our code in a nice way. It is a new R and Python IDE that allows you to put your code in panels that you can connect to each other. Inside each panel you can use any R or python code you would normally use
In this example, we load two files that we import via pandas. We then merge them by Unique_person_id. Once we joined them, we decide to keep a few features. The objective of this exercise is to predict that variable called IS_VIP_CLIENT which has two levels. This is essentially a simple classification problem. Note that pandas dataframes are automatically loaded in prython using the panel that modified or created them. Each panel has an IN/OUT connection, meaning that any panel connected to OUT, will be able to interact with the variables/objects defined in that OUT panel.
This panel contains the actual pycaret model. Note that we just call a setup function that does pretty much all. We pass a list of the categorical features, a list of the numeric ones, and we are almost done. In fact, pycaret can infer types automatically,
but this is usually prone to some errors. Note we also pass
silent=True, in order to block caret from requiring user input (currently not allowed in prython 2.00). The compare_models and the pull are just needed to actually do the computation, and create a
dataframe with the results. We are sorting the results by AUC (area under the curve), but we could use other metrics.
As it can be seen the results are good (though not spectacular), and the model is quite acceptable (a bad AUC value is 0.5).
Note that the best model here is a random forest, but by a thin margin.
How do we run it? Each panel has three running modes. You can see the blue carets next to the python button (this button is used to switch between R and python). The first running mode runs only one panel, the second one runs up to that panel (meaning everything that is connected to IN) and the third one runs everything that is connected to OUT (meaning everything that uses the objects defined in here). In this case we can click on the second running mode in this panel and that will run the necessary branch (going upwards). This avoids the necessity to recompute the other caret model on the left. Of course, we could go to the panel on the top part, and click on the third running mode (run everything below) which will recompute both pycaret models.
Let's create another experiment, we will use the same pycaret code, but now we will create a new panel with
feature_selection=True. This creates conceptually a different model: variables will be selected, so not everything we loaded will end up in the final model.
The rationale is simple: some features don't really help, and just add noise, which cause overfitting. As we can see, there is an improvement. The beauty of prython is that we can put the two panels next to each other, and compare the results directly. This is very uncomfortable in
almost every programming language.
Making predictions with pycaret is incredibly simple, we just call
predict_model(model,dataset) This creates a dataframe containing those predictions.
What if we want to interact with that result dataframe. Let's attach a console so we don't need to rerun everything. We can now run line by line. Let's for example get the cases where the Recall was smaller than 0.4 using
result[result["Recall"] < 0.4] . That is printed in the console displayed here. It's much faster and avoids the re-computations, but not every python object gets loaded here (see prython documentation).
Prefer a video?
Here we use pycaret to choose the best classification model
You will also find the code, project and files that you can download from a github repo.https://github.com/fjuretig/amazing_data_science_projects.git