Purpose
The
caret
package includes a function for data splitting, createTimeSlices()
, that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow
and horizon
, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon
rows. If fixedWindow = TRUE
, the training set always has n =initialWindow
rows. This works well for regular time series, but what if your observations aren’t recorded at regular intervals? How can you divide your data into training/validation sets that span fixed time intervals instead of a fixed number of rows?fixedWindow
and horizon
Illustrated
Allow me to present a solution:
createIrregularTimeSlices <- function(y, initialWindow, horizon = 1, unit = c("sec", "min", "hour", "day", "week", "month", "year", "quarter"), fixedWindow = TRUE, skip = 0) {
if(inherits(y, 'Date')) y <- as.POSIXct(y)
stopifnot(inherits(y, 'POSIXt'))
# generate the sequence of date/time values over which to split. These will always be in ascending order, with no missing date/times.
yvals <- seq(from = lubridate::floor_date(min(y), unit),
to = lubridate::ceiling_date(max(y), unit),
by = unit)
# determine the start and stop date/times for each time slice
stops <- seq_along(yvals)[initialWindow:(length(yvals) - horizon)]
if (fixedWindow) {
starts <- stops - initialWindow + 1
}else {
starts <- rep(1, length(stops))
}
# function that returns the indices of y that are between the start and stop date/time for a slice
ind <- function(start, stop, y, yvals) {
which(y > yvals[start] & y <= yvals[stop])
}
train <- mapply(ind, start = starts, stop = stops, MoreArgs = list(y = y, yvals = yvals), SIMPLIFY = FALSE)
test <- mapply(ind, start = stops, stop = (stops + horizon), MoreArgs = list(y = y, yvals = yvals), SIMPLIFY = FALSE)
names(train) <- paste("Training", gsub(" ", "0", format(seq(along = train))), sep = "")
names(test) <- paste("Testing", gsub(" ", "0", format(seq(along = test))), sep = "")
# reduce the number of slices returned if skip > 0
if (skip > 0) {
thin <- function(x, skip = 2) {
n <- length(x)
x[seq(1, n, by = skip)]
}
train <- thin(train, skip = skip + 1)
test <- thin(test, skip = skip + 1)
}
# eliminate any slices that have no observations in either the training set or the validation set
empty <- c(which(sapply(train, function(x) length(x) == 0)),
which(sapply(test, function(x) length(x) == 0)))
if(length(empty) > 0){
train <- train[-empty]
test <- test[-empty]
}
out <- list(train = train, test = test)
out
}
Some features to note:
- It doesn’t matter what order
y
is in when passed to the function. - It doesn’t matter if there are unrepresented time periods in
y
. The function groups data byunit
, using allunit
s inrange(y)
, whether or not there is an observation within eachunit
. - If
unit
s without any observations result in a partition with an empty training set or an empty validation set, that partition is not returned.
Example
For starters, we need a data set with a date/time variable. Lets use the
economics
data included in ggplot2
.library(ggplot2)
data(economics)
Next, lets use
createIrregularTimeSlices()
to create data partitions. I’ll use a fixed window of 20 quarters for training data, to be validated on the following 4 quarters. There are 170 possible 20/4 month training/validation sets in the data. To reduce the number of trainin/validation combinations, I use the skip
argument to only keep every fourth resample, reducing the number of resamples and thus reducing the training time.my_partitions <- createIrregularTimeSlices(economics$date, initialWindow = 20, horizon = 4, unit = "quarter", fixedWindow = T, skip = 4)
Finally, lets use the partitions to train a model.
library(caret)
library(mgcv)
library(nlme)
ctrl <- trainControl(index = my_partitions$train, indexOut = my_partitions$test)
mod <- train(psavert ~ pce + pop + uempmed + unemploy, data = economics, method = 'gam', trControl = ctrl)
Note that
caret
calculates the average performance across resamples. createIrregularTimeSlices()
can produce resamples with varying sample sizes in the validation set, so you may want to take a weighted average of the calculated performance values, weighted by the sample size in the validation set.
The indices created with
createIrregularTimeSlices()
are stored within the caret model object, so you can inspect them later to retrive the training/validation sample sizes.training_sample_size <- sapply(mod$control$index, length)
validation_sample_size <- sapply(mod$control$indexOut, length)
cbind(training_sample_size, validation_sample_size)
## training_sample_size validation_sample_size
## Training001 55 12
## Training006 57 12
## Training011 57 12
## Training016 57 12
## Training021 57 12
## Training026 57 12
## Training031 57 12
## Training036 57 12
## Training041 57 12
## Training046 57 12
## Training051 57 12
## Training056 57 12
## Training061 57 12
## Training066 57 12
## Training071 57 12
## Training076 57 12
## Training081 57 12
## Training086 57 12
## Training091 57 12
## Training096 57 12
## Training101 57 12
## Training106 57 12
## Training111 57 12
## Training116 57 12
## Training121 57 12
## Training126 57 12
## Training131 57 12
## Training136 57 12
## Training141 57 12
## Training146 57 12
## Training151 57 12
## Training156 57 12
## Training161 57 12
## Training166 57 12
Add a comment