Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method. The post is divided into two parts, Motivation and Code. If you’re savvy to the topic of class imbalanced data and you just want to see the code already, skip ahead to the Code section.

Motivation

Recently I was working on a 2-class classification problem where my data suffered from a fairly severe class imbalance (I had a lot more “non-events” than “events”). As far as I know, there are three ways to deal with this:
  1. Use an imbalanced cost matrix for your model.
  2. Over-sample “events”, under-sample “non-events”, or some combination of the two.
  3. Fit a classification model as you normally would, use it to generate the probability of an instance being in the “event” class, and then play with your cutoff point - that is, experiment with raising or lowering the probability of being an “event” at which you draw the line between “event” and “non-event”.
The first approach is a good one. Using an imbalanced cost matrix makes different errors “cost” different amounts. In my case, with “events” being rare, it makes sense that calling an event a non-event (missing a rare event) should be more expensive than calling a non-event an event (a false positive). Imposing costs this way actually causes your model parameters to adjust accordingly, whereas option 3 above does not. The major problem with this approach is that not all modelling techniques support non-uniform cost matrices. As far as I know, within the caret framework only three model types support cost-sensitive model fitting: C5.0, CART, and SVMs.
The second approach (over/under sampling) is perfectly fine if you have the data to do it. However, in my case, I had 15% events and 85% non-events. To balance the two classes via under-sampling the non-events I would have to randomly select 15% of the non-events and discard 70% of my data. I didn’t have enough data to feel confident doing this. On the other hand, to balance my data via over-sampling the events, I would have to include each observation that was an event 5 to 6 times. Having this many duplicate data points would surely bias my model. There are ways to randomly “fuzz” your over-sampled observations so you’re not just including a bunch of duplicate observations, but for reasons I won’t get into, I didn’t see this as a viable option with my data. Of course you could compromise and mix over-sampling and under-sampling, but I decided against this. There are actually really clever ways of doing this and one should not write it off as a bad or sub-optimal strategy for dealing with class imbalanced data.
The third approach is a commonly used one. Typically the strategy looks like this:
  • Partition your data into a training/validation set and a test set (the common wisdom is 90% training/validation, 10% test).
  • Estimate model hyperparameters (also called tuning parameters) using some sort of cross-validation with your training/validation data.
  • With hyperparameters picked out, fit a model to the entire training/validation set (without leaving out a fold for cross-validation purposes).
  • Generate class probabilities for the test set using the model from the previous step.
  • Using the predicted class probabilities and known classes of the test data, find the cutoff point that yields the “best” class predictions, defining “best” however you like. Note that common measures of model performance such as area under the ROC curve are independent of the chosen cutoff point, so such measures are not appropriate here. A good measure I’ve seen is the euclidean distance between your models’ sensitivity/specificity and the ideal sensitivity/specificity of 1/1, that is, Performance=(Sensitivity1)2+(Specificity1)2
  • Re-fit the model with the chosen hyperparameters, this time on the entire data set.
  • Use this model to generate class probabilities for new observations and the chosen cutoff point to score the new data as an event or non-event.
There are two major drawbacks to this method. The first is that it “costs” you data when searching for optimal hyperparameters - you’re only using 90% of what’s available. This isn’t a big deal if you’ve got plenty of data. The second drawback is that the cutoff point you choose is biased towards a small portion of your data. You only used 10% of your data (with no cross-validation or anything!) to estimate the “best” cutoff point. There is the additional criticism that you trained your model on unbalanced data, which produces a model that is very good at identifying the prevalent class, but poor at identifying the rare class. A very good blog post I read recently suggests that while this criticism is not unfounded, a model fit to unbalanced data does not under-perform by very much. Choosing a cutoff probability other than 0.5 also corrects the “unbalanced” nature of your model.
The caret package provides an excellent framework for model fitting/validation, serving as an API of sorts to many different R packages with different modelling procedures. It also allows you to define your own modelling procedure if you wish. There are several examples of this on thecaret package website and one of these examples demonstrates adding the probability cutoff point discussed above to the list of a model’s hyperparameters. This is an interesting technique that solves the problem of data “cost” and bias in the selection of a cutoff probability other than the default 0.5. The resulting procedure follows this algorithm:
define the values of hyperparameters over which you would like to grid search
define the values of cutoff point over which you would like to search
for each combination of hyperparameters H_i {
  for each cross-validation fold F_j {
    fit a model M_{i,j} using hyperparameters H_i and all data except F_j
    use model M_{i,j} to predict class probabilities for hold out set F_j
    for each value of cutoff point C_k {
      generate class predictions for hold out set F_j from class probabilities using C_k as the threshold
      evaluate the performance of these predictions
    }
  }
}
Assuming you are using ten-fold cross validation repeated five times, following this algorithm will yield 50 estimates of model performance for every hyperparameter/threshold pair, without costing any extra data to estimate the threshold value. From these 50 performance estimates you can calculate a mean and standard error and choose the hyperparameter/threshold pair that yields the best model performance just as you would choose hyperparameters for any model. Not only does this not cost any extra data, but your choice of cutoff point is much less biased than had you estimated it using just one hold out set.
A reflexive argument against this is “but now I’ve massively increased the size of my grid search!” This is true; If you want to search over 10 values of hyperparameter A, 10 values of hyperparameter B, and 10 values of cutoff point, you now have 10×10×10=1000 parameter combinations to search over instead of just 10×10=100 combinations if you only grid search over the hyperparameters. However, you don’t need to fit a separate model to evaluate each cutoff value. You only need to fit a model for each combination of hyperparameters/cross-validation folds and then generate class probabilities (once) and score the data as “event” or “non-event” using each cutoff probability. So while the grid you must search over expands, the added time cost is just the time it takes to generate some class probabilities and execute a bunch of if-else logic. This is negligible in comparison to the time it takes to fit all those models.

Code

The example of creating a custom modelling procedure on the caret package website adds this threshold optimizing functionality to therandom forest method. I used this example to add the functionality to the C5.0 method. For kicks, I will also add the choice to use/not use fuzzy thresholding as a Boolean model hyperparameter.
To start, let’s grab the code behind the default C5.0 method and inspect it.
library(caret)
my_mod <- getModelInfo("C5.0", regex = FALSE)[[1]]
names(my_mod)
 [1] "label"      "library"    "loop"       "type"       "parameters"
 [6] "grid"       "fit"        "predict"    "prob"       "predictors"
[11] "varImp"     "tags"       "sort"      
So there are 13 elements in this list that define exactly what train() does when you pass it the argument method="C5.0". We will need to edit some of these, but not all of them. It’s definitely easier to start from the default C5.0 method code and make changes as needed than to start with an empty list and define each list element from scratch.

label

The element label just provides the name of the algorithm used. I suppose we could change this to read "C5.0thresh" or something rude to pester our colleagues, but lets resist the temptation (oh alright, if you must, my_mod$label <- "JimsPasswordIsHisBirthday").
my_mod$label
[1] "C5.0"

tags

Similarly, there is no need to modify the model’s tags element, but if you must,my_mod$tags <- c(my_mod$tags, "awesome sauce", "custom model best model", "Jims birthday is 8/16/85")

library

The element library provides a character vector of libraries that must be loaded for the modelling code to be run. We aren’t going to add anything that uses a different library, so we don’t need to change this.

type

The type element simply specifies whether or not the model procedure is used for classification, regression, or both. C5.0 is a strictly classification algorithm, and we aren’t about to change that.

parameters

The parameters element lists the model’s tuning parameters (also called hyperparameters) and describes them. The stock C5.0 method parameters are:
my_mod$parameters
  parameter     class                 label
1    trials   numeric # Boosting Iterations
2     model character            Model Type
3    winnow   logical                Winnow
Let’s add a cutoff parameter and a fuzzy parameter. The probability cutoff will be a numeric scalar and fuzzy should be a Boolean indicator of whether or not to use fuzzy thresholding to fit the model.
Note that the parameter trials is of type numeric even though the number of trials will always take on an integer value. Classes assigned here be one of numericcharacteror logical.
my_mod$parameters <- data.frame(
  parameter = c("trials", "model", "winnow", "fuzzy", "cutoff"),
  class = c("numeric", "character", "logical", "logical", "numeric"),
  label = c("# Boosting Iterations", "Model Type", "Winnow", "Fuzzy Thresholding", "Probability Cutoff"))

grid

The grid element is a function that will be used to create a grid of hyperparameters to search over if a grid of parameters is not explicitly passed to train() via the tuneGrid argument. Currently the grid function only produces a grid of parameters for trialsmodel, and winnow. Lets update this to create a grid that also includes fuzzy and cutoff. Again I’m starting from the existing code and just adding.
The len argument here is passed from the tuneLength argument of train(), which the documentation defines as “an integer denoting the number of levels for each tuning parameter that should be generated by train.”
my_mod$grid <- function(x, y, len = NULL){
  # values of "trials" to try
  c5seq <- if(len==1)  1 else  c(1, 10*((2:min(len, 11)) - 1))
  # values of "cutoff" to try
  cutoff_seq <- seq(.01, .99, length=len)
  
  # produce the grid
  expand.grid(trials = c5seq, 
              model = c("tree", "rules"), 
              winnow = c(TRUE, FALSE), 
              fuzzy = c(TRUE, FALSE), 
              cutoff = cutoff_seq)
}

fit

The fit element should be a function for actually fitting the model. This function calls the C5.0() function (from the C50 package) and returns a fitted model. It should pass hyperparameter values from the param argument to the model fitting function, as well as any other arguments the user may have passed to train() through the ... interface that are intended for the fitting function. I will also have this function throw an error if the target variable has more than two classes since this custom model fitting procedure is only intended for the case where we have two classes.
Note that this fitting function does not reference the cutoff point at all. The cutoff point is estimated independently of the model fitting.
my_mod$fit <- function(x, y, wts, param, lev, last, classProbs, ...){ 
  
  # throw an error if there are more than 2 classes in the target variable
  if(length(levels(y)) != 2){
    stop("This only works for 2-class problems!")
  }
  
  # grab any arguments passed from train through ...
  theDots <- list(...)
  
  # if one of the arguments is a C5.0Control object,
  if(any(names(theDots) == "control")){ 
    # change its value of winnow to match the value given in param
    theDots$control$winnow <- param$winnow
    # change its value of fuzzyThreshold to match the value given in param
    theDots$control$fuzzyThreshold <- param$fuzzy
  }else{
  # if a C5.0Control object is not given, create one
    theDots$control <- C5.0Control(winnow = param$winnow,
                                   fuzzyThreshold = param$fuzzy)
  }
  
  # put all of the model fitting arguments into one list
  argList <- list(x = x, y = y, weights = wts, trials = param$trials,
                  rules = param$model == "rules")
  # append anything left in ...
  argList <- c(argList, theDots)
  
  # fit a C5.0 model using the argumets in argList
  do.call("C5.0.default", argList)
}

loop

The loop element is a clever construct that can be used to derive predictions for several models from just one model object. Take a look at the default C5.0 loop function.
my_mod$loop
function(grid) {     
                    loop <- ddply(grid, c("model", "winnow"),
                                  function(x) c(trials = max(x$trials)))                 
                    
                    submodels <- vector(mode = "list", length = nrow(loop))
                    for(i in seq(along = loop$trials))
                    {
                      index <- which(grid$model == loop$model[i] & 
                                       grid$winnow == loop$winnow[i])
                      trials <- grid[index, "trials"] 
                      submodels[[i]] <- data.frame(trials = trials[trials != loop$trials[i]])
                    }     
                    list(loop = loop, submodels = submodels)
                  }
This takes the tuning parameter grid and picks out some combinations of tuning parameters to fit a model for and others to be “sub-models”. More specifically, for every unique combination of “model” and “winnow”, it picks out the largest value of “trials” and fits that model. All other models with the same “model” and “winnow” values but smaller “trials” values become sub-models of this one. This is because if you fit a C5.0 model withtrials = 50 (50 boosting iterations), you can use that model to make predictions for trials = 50, or any number smaller than 50. The C5.0 package allows the user to “strip off” some of the boosting iterations and recover a model with fewer boosting iterations. So instead of fitting a model with 50 boosting iterations, then fitting another model with 40 boosting iterations, and so on, you fit one model with 50 boosting iterations and then use it to generate predictions for all of its sub-models. Note that the model fit with 50 boosting iterations and then stripped back to 40 iterations is exactly the same as the model fit using only 40 boosting iterations. So you save a ton of time by not fitting each model.
Now here’s the tricky part. The default loop function takes the largest trials value for each model/winnow value pair and makes all smallertrials values sub-models of that model. We want to make sub-models of each of these models containing each cutoff value. Since the caretframework doesn’t allow for sub-sub-models (to the best of my knowledge), we will just make the model with the largest trials and the largestcutoff the “main” model for each model/winnow/fuzzy combination and make all trials/cutoff combinations sub-models of this “main” model.
my_mod$loop <- function(grid){
  
  # for each model/winnow/fuzzy combination, choose the model with
  # the largest trials and cutoff to be the "main" model
  loop <- ddply(grid, c("model", "winnow", "fuzzy"),
                function(x) c(trials = max(x$trials), cutoff = max(x$cutoff)))
  
  # create a list with one element for each "main" model
  submodels <- vector(mode = "list", length = nrow(loop))
  
  # for each "main" model, make a grid of all sub-models
  for(i in 1:nrow(loop)){
    index <- which(grid$model == loop$model[i] &
                   grid$winnow == loop$winnow[i] &
                   grid$fuzzy == loop$fuzzy[i] & 
                   (grid$trials < loop$trials[i] | grid$cutoff < loop$cutoff[i]))
    
    submodels[[i]] <- grid[index, c('trials', 'cutoff')]
  }
  
  # return the main models and the sub-models separately
  list(loop = loop, submodels = submodels)
}

predict

The predict element is a function for making class predictions (as opposed to class probabilities) given a model. This also utilizes the sub-models defined by the loop element to make predictions for any sub-models associated with the model object passed to it. There will only be model objects created for the “main” models and never for the sub-models. This is where we save a ton of time. Every sub-model is a model we don’t have to fit. Brilliant!
my_mod$predict <- function(modelFit, newdata, submodels = NULL){
  
  # generate probability of membership in first class
  class1Prob <- predict(modelFit, newdata, type='prob')[, modelFit$obsLevels[1]]
  
  # now assign observations to a class using the cutoff value provided
  out <- ifelse(class1Prob >= modelFit$tuneValue$cutoff,
                modelFit$obsLevels[1],
                modelFit$obsLevels[2])
  
  # now make predictions for any sub-models
  if(!is.null(submodels)){
    tmp <- out # hold this for a second plz
    # create a list with an element for each sub-model plus one for the main model
    out <- vector(mode = "list", length = nrow(submodels) + 1)
    out[[1]] <- tmp # the first element is the main model
    
    for(j in 1:nrow(submodels)){
      class1Prob_again <- predict(modelFit, newdata, type='prob', trial=submodels$trials[j])[, modelFit$obsLevels[1]]
      out[[j+1]] <- ifelse(class1Prob_again >= submodels$cutoff[j],
                           modelFit$obsLevels[1],
                           modelFit$obsLevels[2])
    }
  }
  out
}

prob

The prob element is a lot like the predict element, only it returns class probabilities instead of event/non-event predictions. These probabilities are independent of the cutoff point, so we don’t need to change anything in prob to accommodate that. It also takes a model object as an input, which contains the information about the added fuzzy parameter, so we don’t need to change anything to accommodate that either. So we can leave prob as is.
We also don’t need to change sortlevelspredictors, or varImp. If you’re looking for an example of modifying those, I’ll direct you to thecaret page on customizing model tuning.
Now let’s take this model for a test drive. I’ll generate some data with a rare class and a prevalent class, using caret’s twoClassSim() function.
set.seed(442)
trainingSet <- twoClassSim(n = 500, intercept = 7)
testingSet  <- twoClassSim(n = 500, intercept = 7)
Class1 is the rare class and Class2 is the prevalent class.
table(trainingSet$Class)

Class1 Class2 
    48    452 
I’ll need to define a performance measure to decide which combination of hyperparameters is best. I’ll go with the performance measure mentioned earlier - the euclidean distance to the perfect model with sensitivity = 1 and specificity = 1. (This is taken from the caret website.)
fourStats <- function(data, lev = levels(data$obs), model = NULL){

  out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))

  # The best possible model has sensitivity of 1 and specificity of 1. 
  # How far are we from that value?
  coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), ncol = 2, byrow = TRUE)
  colnames(coords) <- c("Spec", "Sens")
  rownames(coords) <- c("Best", "Current")
  c(out, Dist = dist(coords)[1])
}
Next I’ll define a grid of hyperparameters to search over. I set winnow = FALSE because winnowing is best for situations with small n and large p. In my experience a model with winnowing is never better than one without if you have n>p. I also limit trials to a maximum of 50. This is just for the sake of speed. Since we know that the prob element of the custom model returns the probabilities of membership in the first class and we know that Class1 is the rare class, I expect the optimal cutoff point will be somewhere below the default value of 0.5. Since it costs almost nothing to evaluate multiple cutoff points, I consider 21 different possible cutoff point values.
mygrid <- expand.grid(trials=c(1, 1:4*10), 
                      model=c('rules', 'tree'), 
                      winnow=FALSE, 
                      fuzzy=c(TRUE, FALSE), 
                      cutoff=c(0.01, seq(0.025, 0.5, by=0.025)))
Next I define the resampling scheme for my model fitting. 10-fold cross-validation, repeated 5 times seems pretty thorough. I also supply the summary function fourStats and set classProbs = TRUE so that fourStats() can compute the model ROC.
mycontrol <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 5,
                          classProbs = TRUE,
                          summaryFunction = fourStats)
Now let’s finally put the custom model to work.
library(plyr)
library(C50)

set.seed(949)
mod1 <- train(Class ~ ., data = trainingSet,
              method = my_mod,
              ## Minimize the distance to the perfect model
              metric = "Dist",
              maximize = FALSE,
              tuneGrid=mygrid,
              trControl = mycontrol)
The fact that this ran without an error is encouraging. It’s always prudent to inspect the results to make sure they make sense.
ggplot.train(mod1)
As expected, cutoff points below 0.5 perform best and models with more boosting iterations perform better. What were the optimal model hyperparameters? How good was the model?
best_ind <- best(x=mod1$results, metric='Dist', maximize=FALSE)
mod1$results[best_ind, ]
   model winnow fuzzy trials cutoff       ROC  Sens      Spec      Dist
92 rules  FALSE FALSE     40  0.175 0.9330292 0.885 0.8548696 0.2135743
        ROCSD    SensSD     SpecSD    DistSD
92 0.06782202 0.1433029 0.04790972 0.1061781
I plan on writing a follow up post to this one in which I will compare this model to several other methods of classifying unbalanced data. But for now, this code works. Whether or not it’s useful remains to be seen.
3

View comments

Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.

Understanding data.table Rolling Joins

Robert Norberg

June 5, 2016

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
2

A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance

Robert Norberg

Monday, April 06, 2015

Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
3

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
2

Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
2

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.

Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
1

A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.

In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.

For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.
My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.