• Rally
  • Introducing Pytalite: Rally's ML model evaluation tool

Introducing Pytalite: Rally's ML model evaluation tool

By Lun Yu | December 9, 2019


In Rally, we use machine learning (ML) to power health activity and program recommendation. We use python/pyspark with Apache spark to train, test and deploy our ML models. Like many other data scientists, one of the problems we face when developing ML models is how to interpret, evaluate and diagnose our models. To achieve that goal, we would like to:

  1. understand how "accurate" model's prediction is. For classification model, it is not only the misclassification, but also how well model distinguishes between "positive" class and "negative" class. More importantly, we would like to understand how it translates into business-related goals, for instance ROI
  2. understand how predictions are made. This is to understand what role each input variable is playing in the prediction. For simple regression (i.e. linear or logistic regression), the coefficient of an input variable tells us the "impact" of such variable; similarly, a single decision tree gives us the splits for each node. In those cases, "feature importance" is explicit and straightforward. For more complicated models, such as ensemble trees and deep learning, "feature importance" is not explicit. Scikit-learn implements an impurity-based feature importance score, which is now widely used for ensemble tree models. However, we would like a model-agnostic way to compute "feature importance" so we can apply it beyond ensemble tree models.

Those motivated us to develop a python/pyspark based package called "Pytalite" that provides model-agnostic tool for model evaluation and diagnostics, especially for pyspark where functions such as partial dependence plots are not yet available. For current version of Pytalie, it supports the following functions:

  • Discrete Precision/Recall Plot
  • Feature Correlation Plot
  • Probability Density Plot
  • Feature Importance Plot
  • Accumulated Local Effect Plot
  • Partial Dependence Plot

In this blog, we are going to show you how to use Pytalite to do model evaluation and diagnostics. Pytalite for python is developed under python 3.7 but it compatible with python 2.7 as well. Pytalite for pyspark is developed to support Spark 2.0 and above. It requires matplotlib>=2.2.x (although 1.4.3 also supported, but latest version is recommended), numpy>=1.9.x, scipy>=0.15.x, multiprocess>=0.70.4. In this blog, we will use Titanic data for demonstration. You can download the dataset here from Kaggle. In order to use Pytalite, you need to have a dataset with your input (X) and outcome variables (y), and a model object (aka the trained model you would like to evaluate, this could be a pickle object or a pyspark model object). In the example we are going to show, we trained a simple multi-layer perceptron model to predict who is survived the Titanic ship wreck.

Discrete Precision/Recall Plot

Here, we pass the dataframe df, and outcome label 'Survived', model object model, and number of deciles num_deciles. What this function does is to plot the probability decile from a binary classification model vs the cumulative precision and recall. Why do we need this plot? Usually we use a binary classification model to predict the probability of having "positive" class. And often, we rely on this probability to determine who to target, for example in the marketing campaign we would like to target customers who are most likely to purchase the product. When we use model for targeting, what we really care about is how many people should be targeted and what is the ROI of such targeting. The discrete precision/recall could help us answer that question.

Precision is simply how many people actually belong to "positive" class among the people we target. As we target people from high probability to low probability, we would expect the cumulative precision to drop. For X% of people we are targeting, we know cost as a function of X%: Cost(X%). Based on the precision P%, we know the return is the function: Return(X%*P%). With Cost and Return, we can simply calculate ROI for business purpose!

Recall on the other hand is simply how many actual "positive" cases captured among all the "positive" cases. As we target more people, we would expect the recall to increase. So if we know the total opportunity is OPP. Cumulative Recall simply tells us when we target X% people ranking by predicted probability, how much opportunity we could capture out of total opportunity. We then could also calculate opportunity cost for business purpose!

Probability Density Plot

Here we pass in model object as a list, which means you can pass more than one model objects at the same time. To distinguish multiple models, it requires to input model names as well. What the plot does is to draw distribution of predicted probability grouping by the True labels. A good classification model has the ability to distinguish "positive" and "negative" classes. What it means is that the distribution of predicted probability of each class should be well separated. If we see the overlaps, it means model struggles to distinguish the two classes. For example, in the above plot, we observe there are some data being predicted as high probability while they actually belong to "negative" class. This plot is going to help us get an idea of how model is performing regarding the different classes. Along with the plot, we provide different metrics to quantify the "separation" of two distributions. We suggest the further reading here and here.

Overall, density plot is for visualizing model performance for further investigation. We would suggest to combine this with some overall model performance metrics such as ROC-AUC to get a better understanding of model performance.

Feature Importance Plot

As we mentioned before, we would like to have a model-agnostic way to calculate feature importance. What we implemented here is from Fisher, Rudin, and Dominici (2018), which is also known as permutation feature importance. What it does is: for feature i, we first calculate the loss using the non-permuted dataframe; then we perturb the values of feature i, for example shifting the values, and calculate the loss using this permuted dataframe. The feature importance is defined as

Losspermute - Lossun-permute

In Pytalite, we permute the values of each feature multiple times with the python multi-processing, and calculate the cross-entropy loss (thus log-loss) difference. In such way, we estimate the error of feature importance values. The plot can rank n_top features. To interpret this feature importance score, we can say how much impact it is going to have on the model predictions if we "mess up" the values of feature i. Under this definition, the important features are those that increase the model prediction loss if we permute their values. Obviously this feature importance is actually relative among the input variables, which is a global property. So far, Pytalite does not support estimating local variable contributions to predictions, such as Shapley value.

Correlation Plot

Besides the simple correlation between input variable and outcome, we also include a distribution of predicted probability for input variable. This will help us understand how the predicted scores vary with a specific feature. Above, we are plotting the binary variable Sex_female vs outcome Survived, we can see that being female associated with higher survival rate (without controlling other variables). And the model predicts more high probability for female as well. We also extend this functionality to continuous variables. In order to plot this segmented chart, we apply auto-binning algorithms to bin the continuous variable. The below figure shows the result of variable Fare.\

Partial Dependence Plot and Accumulative Local Effect Plot

Besides the feature importance which is a global property, we would also like to have an understanding of marginal effect of one feature on the prediction outcome. For this purpose we develop the functions to plot partial dependence plot (PDP) and accumulative local effect plot (ALE). We suggest readers for further readings on those two plots on here and here. Simply put, PDP and ALE both describe how a feature affects the prediction on average. The difference is that PDP averages over a marginal distribution, while ALE averages over a conditional distribution. Thus, if variables are somewhat correlated with each other, we may want to use ALE rather than PDP. We suggest plotting the both to see if the effect is consistent or not. Another note is that we found ALE on the categorical variable does not make much sense. So Pytalite only implements ALE for continuous variables. In the following examples, we can see that ALE and PDP plots of Fare are very similar.

partial dependence plot

accumulative local effect plot

Pytalite project can be found in this Git repo. And examples for usage with python and pyspark could be found here.

Lun Yu

Keep Exploring

Would you like to see more? Explore