Choosing models via pycaret (python)

Aug 2021 by Francisco Juretig

Almost every time we build an ML model, we need to choose four things: the features, the transformations to those features, the model itself, and the hyper-parameters for that model. Almost invariably, this involves using scikit-learn or a similar library, and doing some manual work to compare and determine which models work well. pycaret is specifically designed to solve this in an automatic way. To be more specific, pycaret can create a table comparing multiple models sorted by whatever metric we want. Of course these models use the best hyper-parameters found automatically, and we can define all sorts of feature transformations to be done automatically (for example: thresholding, normalization, one-hot-encoding, combining levels with few cases, etc). On top of that, it also does automatic feature selection. In other words, we just pass a table containing the data, and the target and pycaret will do everything for us. Doesn't get any easier!

In this example, we load two files that we import via pandas. We then merge them by Unique_person_id. Once we joined them, we decide to keep a few features. The objective of this exercise is to predict that variable called IS_VIP_CLIENT which has two levels. This is essentially a simple classification problem. Note that pandas dataframes are automatically loaded in prython using the panel that modified or created them. Each panel has an IN/OUT connection, meaning that any panel connected to OUT, will be able to interact with the variables/objects defined in that OUT panel.

This panel contains the actual pycaret model. Note that we just call a setup function that does pretty much all. We pass a list of the categorical features, a list of the numeric ones, and we are almost done. In fact, pycaret can infer types automatically, but this is usually prone to some errors. Note we also pass silent=True, in order to block caret from requiring user input (currently not allowed in prython 2.00). The compare_models and the pull are just needed to actually do the computation, and create a dataframe with the results. We are sorting the results by AUC (area under the curve), but we could use other metrics. As it can be seen the results are good (though not spectacular), and the model is quite acceptable (a bad AUC value is 0.5).


Note that the best model here is a random forest, but by a thin margin.



Let's create another experiment, we will use the same pycaret code, but now we will create a new panel with feature_selection=True. This creates conceptually a different model: variables will be selected, so not everything we loaded will end up in the final model. The rationale is simple: some features don't really help, and just add noise, which cause overfitting. As we can see, there is an improvement. The beauty of prython is that we can put the two panels next to each other, and compare the results directly. This is very uncomfortable in almost every programming language.

Making predictions with pycaret is incredibly simple, we just call predict_model(model,dataset) This creates a dataframe containing those predictions.

What if we want to interact with that result dataframe. Let's attach a console so we don't need to rerun everything. We can now run line by line. Let's for example get the cases where the Recall was smaller than 0.4 using result[result["Recall"] < 0.4] . That is printed in the console displayed here. It's much faster and avoids the re-computations, but not every python object gets loaded here (see prython documentation).

Prefer a video?

Here we use pycaret to choose the best classification model

You will also find the code, project and files that you can download from a github repo.

https://github.com/fjuretig/amazing_data_science_projects.git