Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows. This works well for regular time series, but what if your observations aren’t recorded at regular intervals? How can you divide your data into training/validation sets that span fixed time intervals instead of a fixed number of rows?
fixedWindow and horizon Illustrated
fixedWindow and horizon Illustrated
Allow me to present a solution:
createIrregularTimeSlices <- function(y, initialWindow, horizon = 1, unit = c("sec", "min", "hour", "day", "week", "month", "year", "quarter"), fixedWindow = TRUE, skip = 0) {
  if(inherits(y, 'Date')) y <- as.POSIXct(y)
  stopifnot(inherits(y, 'POSIXt'))
  
  # generate the sequence of date/time values over which to split. These will always be in ascending order, with no missing date/times.
  yvals <- seq(from = lubridate::floor_date(min(y), unit), 
               to = lubridate::ceiling_date(max(y), unit), 
               by = unit)
  
  # determine the start and stop date/times for each time slice
  stops <- seq_along(yvals)[initialWindow:(length(yvals) - horizon)]
  if (fixedWindow) {
    starts <- stops - initialWindow + 1
  }else {
    starts <- rep(1, length(stops))
  }
  
  # function that returns the indices of y that are between the start and stop date/time for a slice 
  ind <- function(start, stop, y, yvals) {
    which(y > yvals[start] & y <= yvals[stop])
  }
  train <- mapply(ind, start = starts, stop = stops, MoreArgs = list(y = y, yvals = yvals), SIMPLIFY = FALSE)
  test <- mapply(ind, start = stops, stop = (stops + horizon), MoreArgs = list(y = y, yvals = yvals), SIMPLIFY = FALSE)
  names(train) <- paste("Training", gsub(" ", "0", format(seq(along = train))), sep = "")
  names(test) <- paste("Testing", gsub(" ", "0", format(seq(along = test))), sep = "")
  
  # reduce the number of slices returned if skip > 0
  if (skip > 0) {
    thin <- function(x, skip = 2) {
      n <- length(x)
      x[seq(1, n, by = skip)]
    }
    train <- thin(train, skip = skip + 1)
    test <- thin(test, skip = skip + 1)
  }
  
  # eliminate any slices that have no observations in either the training set or the validation set
  empty <- c(which(sapply(train, function(x) length(x) == 0)),
             which(sapply(test, function(x) length(x) == 0)))
  if(length(empty) > 0){
    train <- train[-empty]
    test <- test[-empty]
  }
  
  out <- list(train = train, test = test)
  out
}
Some features to note:
  • It doesn’t matter what order y is in when passed to the function.
  • It doesn’t matter if there are unrepresented time periods in y. The function groups data by unit, using all units in range(y), whether or not there is an observation within each unit.
  • If units without any observations result in a partition with an empty training set or an empty validation set, that partition is not returned.

Example

For starters, we need a data set with a date/time variable. Lets use the economics data included in ggplot2.
library(ggplot2)
data(economics)
Next, lets use createIrregularTimeSlices() to create data partitions. I’ll use a fixed window of 20 quarters for training data, to be validated on the following 4 quarters. There are 170 possible 20/4 month training/validation sets in the data. To reduce the number of trainin/validation combinations, I use the skip argument to only keep every fourth resample, reducing the number of resamples and thus reducing the training time.
my_partitions <- createIrregularTimeSlices(economics$date, initialWindow = 20, horizon = 4, unit = "quarter", fixedWindow = T, skip = 4)
Finally, lets use the partitions to train a model.
library(caret)
library(mgcv)
library(nlme)
ctrl <- trainControl(index = my_partitions$train, indexOut = my_partitions$test)
mod <- train(psavert ~ pce + pop + uempmed + unemploy, data = economics, method = 'gam', trControl = ctrl)
Note that caret calculates the average performance across resamples. createIrregularTimeSlices() can produce resamples with varying sample sizes in the validation set, so you may want to take a weighted average of the calculated performance values, weighted by the sample size in the validation set.
The indices created with createIrregularTimeSlices() are stored within the caret model object, so you can inspect them later to retrive the training/validation sample sizes.
training_sample_size <- sapply(mod$control$index, length)
validation_sample_size <- sapply(mod$control$indexOut, length)
cbind(training_sample_size, validation_sample_size)
##             training_sample_size validation_sample_size
## Training001                   55                     12
## Training006                   57                     12
## Training011                   57                     12
## Training016                   57                     12
## Training021                   57                     12
## Training026                   57                     12
## Training031                   57                     12
## Training036                   57                     12
## Training041                   57                     12
## Training046                   57                     12
## Training051                   57                     12
## Training056                   57                     12
## Training061                   57                     12
## Training066                   57                     12
## Training071                   57                     12
## Training076                   57                     12
## Training081                   57                     12
## Training086                   57                     12
## Training091                   57                     12
## Training096                   57                     12
## Training101                   57                     12
## Training106                   57                     12
## Training111                   57                     12
## Training116                   57                     12
## Training121                   57                     12
## Training126                   57                     12
## Training131                   57                     12
## Training136                   57                     12
## Training141                   57                     12
## Training146                   57                     12
## Training151                   57                     12
## Training156                   57                     12
## Training161                   57                     12
## Training166                   57                     12
0

Add a comment

Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.

Understanding data.table Rolling Joins

Robert Norberg

June 5, 2016

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
2

A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance

Robert Norberg

Monday, April 06, 2015

Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
3

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
2

Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
2

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.

Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
1

A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.

In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.

For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.
My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.