A Custom caret
C5.0 Model for 2-Class Classification Problems with Class Imbalance
Robert Norberg
Monday, April 06, 2015
Introduction
In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent
caret
package framework and is akin to the example on the package website, but the example shows an extension of therandom forest
(or rf
) method while I present an extension to the C5.0
method. The post is divided into two parts, Motivation and Code. If you’re savvy to the topic of class imbalanced data and you just want to see the code already, skip ahead to the Code section.Motivation
Recently I was working on a 2-class classification problem where my data suffered from a fairly severe class imbalance (I had a lot more “non-events” than “events”). As far as I know, there are three ways to deal with this:
- Use an imbalanced cost matrix for your model.
- Over-sample “events”, under-sample “non-events”, or some combination of the two.
- Fit a classification model as you normally would, use it to generate the probability of an instance being in the “event” class, and then play with your cutoff point - that is, experiment with raising or lowering the probability of being an “event” at which you draw the line between “event” and “non-event”.
The first approach is a good one. Using an imbalanced cost matrix makes different errors “cost” different amounts. In my case, with “events” being rare, it makes sense that calling an event a non-event (missing a rare event) should be more expensive than calling a non-event an event (a false positive). Imposing costs this way actually causes your model parameters to adjust accordingly, whereas option 3 above does not. The major problem with this approach is that not all modelling techniques support non-uniform cost matrices. As far as I know, within the
caret
framework only three model types support cost-sensitive model fitting: C5.0, CART, and SVMs.
The second approach (over/under sampling) is perfectly fine if you have the data to do it. However, in my case, I had 15% events and 85% non-events. To balance the two classes via under-sampling the non-events I would have to randomly select 15% of the non-events and discard 70% of my data. I didn’t have enough data to feel confident doing this. On the other hand, to balance my data via over-sampling the events, I would have to include each observation that was an event 5 to 6 times. Having this many duplicate data points would surely bias my model. There are ways to randomly “fuzz” your over-sampled observations so you’re not just including a bunch of duplicate observations, but for reasons I won’t get into, I didn’t see this as a viable option with my data. Of course you could compromise and mix over-sampling and under-sampling, but I decided against this. There are actually really clever ways of doing this and one should not write it off as a bad or sub-optimal strategy for dealing with class imbalanced data.
The third approach is a commonly used one. Typically the strategy looks like this:
- Partition your data into a training/validation set and a test set (the common wisdom is 90% training/validation, 10% test).
- Estimate model hyperparameters (also called tuning parameters) using some sort of cross-validation with your training/validation data.
- With hyperparameters picked out, fit a model to the entire training/validation set (without leaving out a fold for cross-validation purposes).
- Generate class probabilities for the test set using the model from the previous step.
- Using the predicted class probabilities and known classes of the test data, find the cutoff point that yields the “best” class predictions, defining “best” however you like. Note that common measures of model performance such as area under the ROC curve are independent of the chosen cutoff point, so such measures are not appropriate here. A good measure I’ve seen is the euclidean distance between your models’ sensitivity/specificity and the ideal sensitivity/specificity of 1/1, that is, Performance=√(Sensitivity−1)2+(Specificity−1)2
- Re-fit the model with the chosen hyperparameters, this time on the entire data set.
- Use this model to generate class probabilities for new observations and the chosen cutoff point to score the new data as an event or non-event.
There are two major drawbacks to this method. The first is that it “costs” you data when searching for optimal hyperparameters - you’re only using 90% of what’s available. This isn’t a big deal if you’ve got plenty of data. The second drawback is that the cutoff point you choose is biased towards a small portion of your data. You only used 10% of your data (with no cross-validation or anything!) to estimate the “best” cutoff point. There is the additional criticism that you trained your model on unbalanced data, which produces a model that is very good at identifying the prevalent class, but poor at identifying the rare class. A very good blog post I read recently suggests that while this criticism is not unfounded, a model fit to unbalanced data does not under-perform by very much. Choosing a cutoff probability other than 0.5 also corrects the “unbalanced” nature of your model.
The
caret
package provides an excellent framework for model fitting/validation, serving as an API of sorts to many different R
packages with different modelling procedures. It also allows you to define your own modelling procedure if you wish. There are several examples of this on thecaret
package website and one of these examples demonstrates adding the probability cutoff point discussed above to the list of a model’s hyperparameters. This is an interesting technique that solves the problem of data “cost” and bias in the selection of a cutoff probability other than the default 0.5. The resulting procedure follows this algorithm:define the values of hyperparameters over which you would like to grid search
define the values of cutoff point over which you would like to search
for each combination of hyperparameters H_i {
for each cross-validation fold F_j {
fit a model M_{i,j} using hyperparameters H_i and all data except F_j
use model M_{i,j} to predict class probabilities for hold out set F_j
for each value of cutoff point C_k {
generate class predictions for hold out set F_j from class probabilities using C_k as the threshold
evaluate the performance of these predictions
}
}
}
Assuming you are using ten-fold cross validation repeated five times, following this algorithm will yield 50 estimates of model performance for every hyperparameter/threshold pair, without costing any extra data to estimate the threshold value. From these 50 performance estimates you can calculate a mean and standard error and choose the hyperparameter/threshold pair that yields the best model performance just as you would choose hyperparameters for any model. Not only does this not cost any extra data, but your choice of cutoff point is much less biased than had you estimated it using just one hold out set.
A reflexive argument against this is “but now I’ve massively increased the size of my grid search!” This is true; If you want to search over 10 values of hyperparameter A, 10 values of hyperparameter B, and 10 values of cutoff point, you now have 10×10×10=1000 parameter combinations to search over instead of just 10×10=100 combinations if you only grid search over the hyperparameters. However, you don’t need to fit a separate model to evaluate each cutoff value. You only need to fit a model for each combination of hyperparameters/cross-validation folds and then generate class probabilities (once) and score the data as “event” or “non-event” using each cutoff probability. So while the grid you must search over expands, the added time cost is just the time it takes to generate some class probabilities and execute a bunch of if-else logic. This is negligible in comparison to the time it takes to fit all those models.
Code
The example of creating a custom modelling procedure on the
caret
package website adds this threshold optimizing functionality to therandom forest
method. I used this example to add the functionality to the C5.0
method. For kicks, I will also add the choice to use/not use fuzzy thresholding as a Boolean model hyperparameter.
To start, let’s grab the code behind the default
C5.0
method and inspect it.library(caret)
my_mod <- getModelInfo("C5.0", regex = FALSE)[[1]]
names(my_mod)
[1] "label" "library" "loop" "type" "parameters"
[6] "grid" "fit" "predict" "prob" "predictors"
[11] "varImp" "tags" "sort"
So there are 13 elements in this list that define exactly what
train()
does when you pass it the argument method="C5.0"
. We will need to edit some of these, but not all of them. It’s definitely easier to start from the default C5.0
method code and make changes as needed than to start with an empty list and define each list element from scratch.label
The element
label
just provides the name of the algorithm used. I suppose we could change this to read "C5.0thresh"
or something rude to pester our colleagues, but lets resist the temptation (oh alright, if you must, my_mod$label <- "JimsPasswordIsHisBirthday"
).my_mod$label
[1] "C5.0"
tags
Similarly, there is no need to modify the model’s
tags
element, but if you must,my_mod$tags <- c(my_mod$tags, "awesome sauce", "custom model best model", "Jims birthday is 8/16/85")
library
The element
library
provides a character vector of libraries that must be loaded for the modelling code to be run. We aren’t going to add anything that uses a different library, so we don’t need to change this.type
The
type
element simply specifies whether or not the model procedure is used for classification, regression, or both. C5.0 is a strictly classification algorithm, and we aren’t about to change that.parameters
The
parameters
element lists the model’s tuning parameters (also called hyperparameters) and describes them. The stock C5.0
method parameters are:my_mod$parameters
parameter class label
1 trials numeric # Boosting Iterations
2 model character Model Type
3 winnow logical Winnow
Let’s add a
cutoff
parameter and a fuzzy
parameter. The probability cutoff will be a numeric scalar and fuzzy
should be a Boolean indicator of whether or not to use fuzzy thresholding to fit the model.
Note that the parameter
trials
is of type numeric
even though the number of trials will always take on an integer value. Classes assigned here be one of numeric
, character
, or logical
.my_mod$parameters <- data.frame(
parameter = c("trials", "model", "winnow", "fuzzy", "cutoff"),
class = c("numeric", "character", "logical", "logical", "numeric"),
label = c("# Boosting Iterations", "Model Type", "Winnow", "Fuzzy Thresholding", "Probability Cutoff"))
grid
The
grid
element is a function that will be used to create a grid of hyperparameters to search over if a grid of parameters is not explicitly passed to train()
via the tuneGrid
argument. Currently the grid
function only produces a grid of parameters for trials
, model
, and winnow
. Lets update this to create a grid that also includes fuzzy
and cutoff
. Again I’m starting from the existing code and just adding.
The
len
argument here is passed from the tuneLength
argument of train()
, which the documentation defines as “an integer denoting the number of levels for each tuning parameter that should be generated by train.”my_mod$grid <- function(x, y, len = NULL){
# values of "trials" to try
c5seq <- if(len==1) 1 else c(1, 10*((2:min(len, 11)) - 1))
# values of "cutoff" to try
cutoff_seq <- seq(.01, .99, length=len)
# produce the grid
expand.grid(trials = c5seq,
model = c("tree", "rules"),
winnow = c(TRUE, FALSE),
fuzzy = c(TRUE, FALSE),
cutoff = cutoff_seq)
}
fit
The
fit
element should be a function for actually fitting the model. This function calls the C5.0()
function (from the C50
package) and returns a fitted model. It should pass hyperparameter values from the param
argument to the model fitting function, as well as any other arguments the user may have passed to train()
through the ...
interface that are intended for the fitting function. I will also have this function throw an error if the target variable has more than two classes since this custom model fitting procedure is only intended for the case where we have two classes.
Note that this fitting function does not reference the cutoff point at all. The cutoff point is estimated independently of the model fitting.
my_mod$fit <- function(x, y, wts, param, lev, last, classProbs, ...){
# throw an error if there are more than 2 classes in the target variable
if(length(levels(y)) != 2){
stop("This only works for 2-class problems!")
}
# grab any arguments passed from train through ...
theDots <- list(...)
# if one of the arguments is a C5.0Control object,
if(any(names(theDots) == "control")){
# change its value of winnow to match the value given in param
theDots$control$winnow <- param$winnow
# change its value of fuzzyThreshold to match the value given in param
theDots$control$fuzzyThreshold <- param$fuzzy
}else{
# if a C5.0Control object is not given, create one
theDots$control <- C5.0Control(winnow = param$winnow,
fuzzyThreshold = param$fuzzy)
}
# put all of the model fitting arguments into one list
argList <- list(x = x, y = y, weights = wts, trials = param$trials,
rules = param$model == "rules")
# append anything left in ...
argList <- c(argList, theDots)
# fit a C5.0 model using the argumets in argList
do.call("C5.0.default", argList)
}
loop
The
loop
element is a clever construct that can be used to derive predictions for several models from just one model object. Take a look at the default C5.0 loop
function.my_mod$loop
function(grid) {
loop <- ddply(grid, c("model", "winnow"),
function(x) c(trials = max(x$trials)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$trials))
{
index <- which(grid$model == loop$model[i] &
grid$winnow == loop$winnow[i])
trials <- grid[index, "trials"]
submodels[[i]] <- data.frame(trials = trials[trials != loop$trials[i]])
}
list(loop = loop, submodels = submodels)
}
This takes the tuning parameter grid and picks out some combinations of tuning parameters to fit a model for and others to be “sub-models”. More specifically, for every unique combination of “model” and “winnow”, it picks out the largest value of “trials” and fits that model. All other models with the same “model” and “winnow” values but smaller “trials” values become sub-models of this one. This is because if you fit a C5.0 model with
trials = 50
(50 boosting iterations), you can use that model to make predictions for trials = 50
, or any number smaller than 50. The C5.0 package allows the user to “strip off” some of the boosting iterations and recover a model with fewer boosting iterations. So instead of fitting a model with 50 boosting iterations, then fitting another model with 40 boosting iterations, and so on, you fit one model with 50 boosting iterations and then use it to generate predictions for all of its sub-models. Note that the model fit with 50 boosting iterations and then stripped back to 40 iterations is exactly the same as the model fit using only 40 boosting iterations. So you save a ton of time by not fitting each model.
Now here’s the tricky part. The default
loop
function takes the largest trials
value for each model
/winnow
value pair and makes all smallertrials
values sub-models of that model. We want to make sub-models of each of these models containing each cutoff value. Since the caret
framework doesn’t allow for sub-sub-models (to the best of my knowledge), we will just make the model with the largest trials
and the largestcutoff
the “main” model for each model
/winnow
/fuzzy
combination and make all trials
/cutoff
combinations sub-models of this “main” model.my_mod$loop <- function(grid){
# for each model/winnow/fuzzy combination, choose the model with
# the largest trials and cutoff to be the "main" model
loop <- ddply(grid, c("model", "winnow", "fuzzy"),
function(x) c(trials = max(x$trials), cutoff = max(x$cutoff)))
# create a list with one element for each "main" model
submodels <- vector(mode = "list", length = nrow(loop))
# for each "main" model, make a grid of all sub-models
for(i in 1:nrow(loop)){
index <- which(grid$model == loop$model[i] &
grid$winnow == loop$winnow[i] &
grid$fuzzy == loop$fuzzy[i] &
(grid$trials < loop$trials[i] | grid$cutoff < loop$cutoff[i]))
submodels[[i]] <- grid[index, c('trials', 'cutoff')]
}
# return the main models and the sub-models separately
list(loop = loop, submodels = submodels)
}
predict
The
predict
element is a function for making class predictions (as opposed to class probabilities) given a model. This also utilizes the sub-models defined by the loop
element to make predictions for any sub-models associated with the model object passed to it. There will only be model objects created for the “main” models and never for the sub-models. This is where we save a ton of time. Every sub-model is a model we don’t have to fit. Brilliant!my_mod$predict <- function(modelFit, newdata, submodels = NULL){
# generate probability of membership in first class
class1Prob <- predict(modelFit, newdata, type='prob')[, modelFit$obsLevels[1]]
# now assign observations to a class using the cutoff value provided
out <- ifelse(class1Prob >= modelFit$tuneValue$cutoff,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
# now make predictions for any sub-models
if(!is.null(submodels)){
tmp <- out # hold this for a second plz
# create a list with an element for each sub-model plus one for the main model
out <- vector(mode = "list", length = nrow(submodels) + 1)
out[[1]] <- tmp # the first element is the main model
for(j in 1:nrow(submodels)){
class1Prob_again <- predict(modelFit, newdata, type='prob', trial=submodels$trials[j])[, modelFit$obsLevels[1]]
out[[j+1]] <- ifelse(class1Prob_again >= submodels$cutoff[j],
modelFit$obsLevels[1],
modelFit$obsLevels[2])
}
}
out
}
prob
The
prob
element is a lot like the predict
element, only it returns class probabilities instead of event/non-event predictions. These probabilities are independent of the cutoff point, so we don’t need to change anything in prob
to accommodate that. It also takes a model object as an input, which contains the information about the added fuzzy
parameter, so we don’t need to change anything to accommodate that either. So we can leave prob
as is.
We also don’t need to change
sort
, levels
, predictors
, or varImp
. If you’re looking for an example of modifying those, I’ll direct you to thecaret
page on customizing model tuning.
Now let’s take this model for a test drive. I’ll generate some data with a rare class and a prevalent class, using
caret
’s twoClassSim()
function.set.seed(442)
trainingSet <- twoClassSim(n = 500, intercept = 7)
testingSet <- twoClassSim(n = 500, intercept = 7)
Class1
is the rare class and Class2
is the prevalent class.table(trainingSet$Class)
Class1 Class2
48 452
I’ll need to define a performance measure to decide which combination of hyperparameters is best. I’ll go with the performance measure mentioned earlier - the euclidean distance to the perfect model with sensitivity = 1 and specificity = 1. (This is taken from the
caret
website.)fourStats <- function(data, lev = levels(data$obs), model = NULL){
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
# The best possible model has sensitivity of 1 and specificity of 1.
# How far are we from that value?
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), ncol = 2, byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
Next I’ll define a grid of hyperparameters to search over. I set
winnow = FALSE
because winnowing is best for situations with small n and large p. In my experience a model with winnowing is never better than one without if you have n>p. I also limit trials
to a maximum of 50. This is just for the sake of speed. Since we know that the prob
element of the custom model returns the probabilities of membership in the first class and we know that Class1
is the rare class, I expect the optimal cutoff point will be somewhere below the default value of 0.5. Since it costs almost nothing to evaluate multiple cutoff points, I consider 21 different possible cutoff point values.mygrid <- expand.grid(trials=c(1, 1:4*10),
model=c('rules', 'tree'),
winnow=FALSE,
fuzzy=c(TRUE, FALSE),
cutoff=c(0.01, seq(0.025, 0.5, by=0.025)))
Next I define the resampling scheme for my model fitting. 10-fold cross-validation, repeated 5 times seems pretty thorough. I also supply the summary function
fourStats
and set classProbs = TRUE
so that fourStats()
can compute the model ROC.mycontrol <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = fourStats)
Now let’s finally put the custom model to work.
library(plyr)
library(C50)
set.seed(949)
mod1 <- train(Class ~ ., data = trainingSet,
method = my_mod,
## Minimize the distance to the perfect model
metric = "Dist",
maximize = FALSE,
tuneGrid=mygrid,
trControl = mycontrol)
The fact that this ran without an error is encouraging. It’s always prudent to inspect the results to make sure they make sense.
ggplot.train(mod1)
As expected, cutoff points below 0.5 perform best and models with more boosting iterations perform better. What were the optimal model hyperparameters? How good was the model?
best_ind <- best(x=mod1$results, metric='Dist', maximize=FALSE)
mod1$results[best_ind, ]
model winnow fuzzy trials cutoff ROC Sens Spec Dist
92 rules FALSE FALSE 40 0.175 0.9330292 0.885 0.8548696 0.2135743
ROCSD SensSD SpecSD DistSD
92 0.06782202 0.1433029 0.04790972 0.1061781
I plan on writing a follow up post to this one in which I will compare this model to several other methods of classifying unbalanced data. But for now, this code works. Whether or not it’s useful remains to be seen.
View comments