This post is a huge jump from the last two - this is not for beginners!! But if you've ever considered building a GUI in R, looked at some of the online documentation, gotten scared, and decided not to, read this!!! Ok here goes.

Dorian Auto GUI

Setup: I built this for a school project. The basic problem setup is from a class I'm taking on operations research using spreadsheets. There exists a car company named Dorian Auto that has the capability to produce 5 different car models - small car, medium car, large car, medium van, and large van. Each of these models require a fixed amount of steel and a fixed amount of labor to produce. Each car model also has a fixed profit associated with it - the payoff for producing one unit of that car model. Here's the tricky part: there is a minimum production quantity for each model as well. For example, if Dorian is going to produce Small Cars, it must produce at least 1000 of them. Here is this information summarized in a data frame:
###### Set Default Values
Dorian <- data.frame(Model = c("Small Car", "Medium Car", "Large Car", "Medium Van", 
    "Large Van"), SteelReq = c(1.5, 3, 5, 6, 8), LabReq = c(30, 25, 40, 45, 
    55), MinProd = c(1000, 1000, 1000, 200, 200), Profit = c(2000, 2500, 3000, 
    5500, 7000))
# Change Model to class character (instead of factor) so that model names
# can be changed later If I don't do this and I try to change one of the
# model names, I will get an error that this new name is not one of the
# current levels of the variable
Dorian$Model <- as.character(Dorian$Model)
Dorian
##        Model SteelReq LabReq MinProd Profit
## 1  Small Car      1.5     30    1000   2000
## 2 Medium Car      3.0     25    1000   2500
## 3  Large Car      5.0     40    1000   3000
## 4 Medium Van      6.0     45     200   5500
## 5  Large Van      8.0     55     200   7000
Dorian also has a fixed amount of resources available:
Materials <- data.frame(Steel = 6500, Labor = 65000)
Materials
##   Steel Labor
## 1  6500 65000
Now my job is to maximize the profit of Dorian Auto by choosing the most profitable production schedule given the resource constraints and the minimum production quantities. I used the Rglpk package to do this. This is essentially an integer optimization problem with several boolean variables. Here's the basic function I used to solve for the optimal solution:
DorianAutoFunction<-function(Inputs,Constraints){
  library(Rglpk)
  # Introduce my data
  Dorian <- data.frame(Inputs)
  Dorian<-Dorian[complete.cases(Dorian),]
  num.models <- nrow(Dorian)
  Materials<-data.frame(Constraints)

  # only x1, x2, x3, x4, x5 contribute to the total profit
  objective  <- c(Dorian$Profit, rep(0, num.models))

  constraints.mat <- rbind(
    c(Dorian$SteelReq, rep(0, num.models)),                    # total steel used
    c(Dorian$LabReq,   rep(0, num.models)),                    # total labor used
    cbind(-diag(num.models), +diag(Dorian$MinProd)),           # MinProd_i * z_i
    cbind(+diag(num.models), -diag(rep(9999999, num.models)))) # x_i - 9999999 x_i

  constraints.dir <- c("<=",
                       "<=",
                       rep("<=", num.models),
                       rep("<=", num.models))

  constraints.rhs <- c(Materials$Steel,
                       Materials$Labor,
                       rep(0, num.models),
                       rep(0, num.models))

  var.types <- c(rep("I", num.models),  # x1, x2, x3, x4, x5 are integers
                 rep("B", num.models))  # z1, z2, z3, z4, z5 are booleans

  mysolution<-Rglpk_solve_LP(obj   = objective,
                             mat   = constraints.mat,
                             dir   = constraints.dir,
                             rhs   = constraints.rhs,
                             types = var.types,
                             max   = TRUE)
  return(mysolution)
}
I must give credit here to flodel on Stack Overflow for this function.
This class and textbook rely entirely on excel and I must point out here that while solving this in excel might be easier, excel actually returns a wrong answer. If you call the function defined above on the Dorian data frame previously defined, you get a solution:
DorianAutoFunction(Dorian, Materials)
## $optimum
## [1] 6408000
## 
## $solution
##  [1] 1000    0    0  202  471    1    0    0    1    1
## 
## $status
## [1] 0
Dorian Auto should produce 1000 Small Cars, 0 Medium Cars, 0 Large Cars, 202 Medium Vans, and 471 Large Vans for a profit of $6408000.
This solution is not necessarily intuitive. Large Vans are the most profitable per unit of resources required, so why don't we make more of these? The only reason we produce any Small Cars or Medium Vans in the solution is because they most efficiently eat up the last bits of resources that would be left over if we were only producing Large Vans. So the solution we arrive at is something like this: 1. Produce the bare minimum amount of Small Cars because these use the smallest amount of Labor and Steel, so these can eat up the lat bits of remaining materials at the end. 2. Is there any intermediate vehicle we could produce that would eat up some of the leftover after we're done producing the Large Vans, while earning us more profit per unit of materials than Small Cars? If so, produce the bare minimum of these. In this particular case, this is the Medium Van. 3. After this, produce as many Large Vans as possible to maximize profit. Large Vans earn us the most per unit of materials required. 4. Now go back and use up the last bits of resources that are too few to produce one Large Van. First try to use the leftovers on a Medium Van (slightly better payoff than Small Cars). If there isn't enough to produce a Medium Van, use the leftovers on a Small Car.
The solution in the textbook is slightly different than the one we arrived at using Rglpk. It says Dorian should produce 1000 Small Cars, 0 Medium Cars, 0 Large Cars, 200 Medium Vans, and 473 Large Vans for a profit ofsum(c(1000,0,0,200,473)*Dorian$Profit) = $6.411 &times; 10<sup>6</sup> At first I thought my function was slightly off - I was getting less profit than the “correct” answer. But upon closer inspection, the textbook (and excel)'s solution is over budget on labor:
sum(c(1000, 0, 0, 200, 473) * Dorian$LabReq) <= Materials$Labor
## [1] FALSE
So the solution presented in the textbook is hardly a solution at all. The whole point of this exercise is to intelligently allocate the current amount of resources available. If we were going to ignore these constraints and simply try to define a strategy to maximize profit, regardless of what was currently available, we would choose to produce exclusively Large Vans because they net the most profit per unit of input materials. The reason excel is wrong here is due to a rounding error, so it's only slightly off, but it's still wrong.
Enough about linear programming though. On to the good stuff. For my presentation, I wanted to present something attractive to summarize the solution so I wrote another little function. This function produces 3 figures and I wanted to display them all in one graphics device in the GUI, so I borrowed some code from the Cookbook for R/) website. (which by the way I use all the time and it's awesome). Here's the function for arranging multiple plots together into one graphics window:
multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
    library(grid)

    # Make a list from the ... arguments and plotlist
    plots <- c(list(...), plotlist)

    numPlots = length(plots)

    # If layout is NULL, then use 'cols' to determine layout
    if (is.null(layout)) {
        # Make the panel ncol: Number of columns of plots nrow: Number of rows
        # needed, calculated from # of cols
        layout <- matrix(seq(1, cols * ceiling(numPlots/cols)), ncol = cols, 
            nrow = ceiling(numPlots/cols))
    }

    if (numPlots == 1) {
        print(plots[[1]])

    } else {
        # Set up the page
        grid.newpage()
        pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

        # Make each plot, in the correct location
        for (i in 1:numPlots) {
            # Get the i,j matrix positions of the regions that contain this subplot
            matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

            print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row, layout.pos.col = matchidx$col))
        }
    }
}
And here's the function for displaying the solution to the problem in a way that's nicer to look at than a spreadsheet:
CalcFunction <- function(Dorian, Materials) {
    mysolution <- DorianAutoFunction(Dorian, Materials)
    num.models <- nrow(Dorian)

    graphdat <- data.frame(Model = Dorian$Model, Production = mysolution$solution[1:num.models], 
        ProfitContributed = mysolution$solution[1:num.models] * Dorian$Profit)

    graphdat2 <- data.frame(Model = rep(Dorian$Model, 2), Materials = rep(c("Steel", 
        "Labor"), each = num.models), percentMaterials = mysolution$solution[1:num.models] * 
        Dorian$SteelReq/Materials$Steel, percentLabor = mysolution$solution[1:num.models] * 
        Dorian$LabReq/Materials$Labor)

    library(ggplot2)
    p <- ggplot(graphdat, aes(x = Model, y = Production, fill = Model)) + geom_bar() + 
        guides(fill = F) + geom_text(label = as.character(graphdat$Production), 
        y = 25) + labs(title = "Production Schedule", x = "")

    q <- ggplot(graphdat2, aes(x = Materials, y = percentMaterials, fill = Model)) + 
        geom_bar(position = "stack") + labs(y = "% Used", title = "Resource Consumption", 
        x = "")

    r <- ggplot(graphdat, aes(x = Model, y = ProfitContributed/1e+06, fill = Model)) + 
        geom_bar() + guides(fill = F) + labs(title = paste("Total Profit = $", 
        as.character(sum(graphdat$ProfitContributed)), sep = ""), y = "Profit ($ Millions)", 
        x = "") + theme(plot.title = element_text(face = "bold", size = 20))

    mydashboard <- multiplot(r, p, q, layout = matrix(c(1, 2, 1, 3), nrow = 2), 
        by.row = T)
    print(mydashboard)
}

Sensitivity Analysis

I was not the first one to present one of these problems to my class and I had noticed in other presentations that something called sensitivity analysis was popular among the students in my class. Being as my presentation would be at least partially peer graded, I figured I had best include some sensitivity analysis. Sensitivity analysis is essentially just asking “What would happen to my solution if I tweaked the inputs just a little?”. For example, if I lowered the profit associated with the sale of one Large Van just a little (say $25), my solution wouldn't change, but if I kept lowering it, at some point I would cross a threshold and it would suddenly become more profitable to adopt an entirely different strategy. After doing some research online, I realized that you can actually get quite fancy with sensitivity analysis in R using all kinds of algorithms and this and that, but I decided to take a much less refined approach. I basically took the constraint I wanted to investigate, changed it a little, solved the problem again with the new value, and repeated. Then I put all of the solutions in a dataframe and plotted them. I decided to plot profit as a function of the changing parameter and the number of each car model produced in the optimal solution as a function of the changing parameter. I made separate sensitivity analysis functions for steel, labor, minimum production quantity, profit/unit, and for each of the materials constraints:
##### Steel Sensitivity Analysis #####
SteelSensitivity <- function(x) {
    library(reshape2)
    library(ggplot2)
    SteelList <- replicate(n = 26 + min(25, (Dorian[x, 2]/0.1) - 1), Dorian, 
        simplify = F)
    SteelList[[1]][x, 2] <- max(Dorian[x, 2] - 2.5, 0.5)
    for (i in 2:length(SteelList)) {
        SteelList[[i]][x, 2] <- SteelList[[i - 1]][x, 2] + 0.1
    }
    SteelSens <- sapply(SteelList, DorianAutoFunction, Constraints = Materials)
    SensDat <- data.frame(t(rbind(sapply(SteelSens[2, ], unlist)[1:5, ], unlist(SteelSens[1, 
        ]))))
    names(SensDat) <- c(Dorian[, 1], "Profit")
    SensDat$Steel <- seq(max(Dorian[x, 2] - 2.5, 0.5), (length(SteelList) - 
        1) * 0.1 + max(Dorian[x, 2] - 2.5, 0.5), by = 0.1)
    SensDat.melt <- melt(data = SensDat, id.vars = c("Steel", "Profit"), measure.vars = c("Small Car", 
        "Medium Car", "Large Car", "Medium Van", "Large Van"))
    prod.plot <- ggplot(SensDat.melt, aes(x = Steel, y = value, color = variable)) + 
        geom_line(lwd = 1.2) + geom_vline(xintercept = Dorian[x, 2], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "", y = "Production Schedule", color = "Model", 
        title = paste("Sensitivity Analysis of", Dorian[x, 1], "Steel Requirement")) + 
        theme(legend.position = "bottom")

    prof.plot <- ggplot(SensDat, aes(x = Steel, y = Profit/1e+05)) + geom_line(color = "red", 
        lwd = 2) + geom_vline(xintercept = Dorian[x, 2], color = "blue", lwd = 2, 
        alpha = 0.2) + labs(x = "Steel Required", y = "Profit ($100,000's)") + 
        theme(legend.position = "bottom")

    multiplot(prod.plot, prof.plot)
}

##### Labor Sensitivity Analysis #####
LaborSensitivity <- function(x) {
    library(reshape2)
    library(ggplot2)
    LaborList <- replicate(n = 26 + min(25, (Dorian[x, 3]) - 1), Dorian, simplify = F)
    LaborList[[1]][x, 3] <- max(Dorian[x, 3] - 25, 1)
    for (i in 2:length(LaborList)) {
        LaborList[[i]][x, 3] <- LaborList[[i - 1]][x, 3] + 1
    }
    LaborSens <- sapply(LaborList, DorianAutoFunction, Constraints = Materials)
    SensDat <- data.frame(t(rbind(sapply(LaborSens[2, ], unlist)[1:5, ], unlist(LaborSens[1, 
        ]))))
    names(SensDat) <- c(Dorian[, 1], "Profit")
    SensDat$Labor <- seq(max(Dorian[x, 3] - 25, 1), (length(LaborList) - 1) + 
        max(Dorian[x, 3] - 25, 1), by = 1)
    SensDat.melt <- melt(data = SensDat, id.vars = c("Labor", "Profit"), measure.vars = c("Small Car", 
        "Medium Car", "Large Car", "Medium Van", "Large Van"))
    prod.plot <- ggplot(SensDat.melt, aes(x = Labor, y = value, color = variable)) + 
        geom_line(lwd = 1.2) + geom_vline(xintercept = Dorian[x, 3], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "", y = "Production Schedule", color = "Model", 
        title = paste("Sensitivity Analysis of", Dorian[x, 1], "Labor Requirement")) + 
        theme(legend.position = "bottom")

    prof.plot <- ggplot(SensDat, aes(x = Labor, y = Profit/1e+05)) + geom_line(color = "red", 
        lwd = 2) + geom_vline(xintercept = Dorian[x, 3], color = "blue", lwd = 2, 
        alpha = 0.2) + labs(x = "Labor Required", y = "Profit ($100,000's)") + 
        theme(legend.position = "bottom")

    multiplot(prod.plot, prof.plot)
}

##### Minimum Production Sensitivity Analysis #####
MinProductionSensitivity <- function(x) {
    library(reshape2)
    library(ggplot2)
    MinProductionList <- replicate(n = 26 + min(25, (Dorian[x, 4]/10) - 1), 
        Dorian, simplify = F)
    MinProductionList[[1]][x, 4] <- max(Dorian[x, 4] - 250, 10)
    for (i in 2:length(MinProductionList)) {
        MinProductionList[[i]][x, 4] <- MinProductionList[[i - 1]][x, 4] + 10
    }
    MinProductionSens <- sapply(MinProductionList, DorianAutoFunction, Constraints = Materials)
    SensDat <- data.frame(t(rbind(sapply(MinProductionSens[2, ], unlist)[1:5, 
        ], unlist(MinProductionSens[1, ]))))
    names(SensDat) <- c(Dorian[, 1], "Profit")
    SensDat$MinProduction <- seq(max(Dorian[x, 4] - 250, 10), (length(MinProductionList) - 
        1) * 10 + max(Dorian[x, 4] - 250, 10), by = 10)
    SensDat.melt <- melt(data = SensDat, id.vars = c("MinProduction", "Profit"), 
        measure.vars = c("Small Car", "Medium Car", "Large Car", "Medium Van", 
            "Large Van"))
    prod.plot <- ggplot(SensDat.melt, aes(x = MinProduction, y = value, color = variable)) + 
        geom_line(lwd = 1.2) + geom_vline(xintercept = Dorian[x, 4], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "", y = "Production Schedule", color = "Model", 
        title = paste("Sensitivity Analysis of", Dorian[x, 1], "Minimum Production Requirement")) + 
        theme(legend.position = "bottom")

    prof.plot <- ggplot(SensDat, aes(x = MinProduction, y = Profit/1e+05)) + 
        geom_line(color = "red", lwd = 2) + geom_vline(xintercept = Dorian[x, 
        4], color = "blue", lwd = 2, alpha = 0.2) + labs(x = "Minimum Production Requirement", 
        y = "Profit ($100,000's)") + theme(legend.position = "bottom")

    multiplot(prod.plot, prof.plot)
}

##### Profit Sensitivity Analysis #####
ModProfitSensitivity <- function(x) {
    library(reshape2)
    library(ggplot2)
    ModProfitList <- replicate(n = 26 + min(25, (Dorian[x, 5]/25) - 25), Dorian, 
        simplify = F)
    ModProfitList[[1]][x, 5] <- max(Dorian[x, 5] - 625, 25)
    for (i in 2:length(ModProfitList)) {
        ModProfitList[[i]][x, 5] <- ModProfitList[[i - 1]][x, 5] + 25
    }
    ModProfitSens <- sapply(ModProfitList, DorianAutoFunction, Constraints = Materials)
    SensDat <- data.frame(t(rbind(sapply(ModProfitSens[2, ], unlist)[1:5, ], 
        unlist(ModProfitSens[1, ]))))
    names(SensDat) <- c(Dorian[, 1], "Profit")
    SensDat$ModProfit <- seq(max(Dorian[x, 5] - 625, 25), (length(ModProfitList) - 
        1) * 25 + max(Dorian[x, 5] - 625, 25), by = 25)
    SensDat.melt <- melt(data = SensDat, id.vars = c("ModProfit", "Profit"), 
        measure.vars = c("Small Car", "Medium Car", "Large Car", "Medium Van", 
            "Large Van"))
    prod.plot <- ggplot(SensDat.melt, aes(x = ModProfit, y = value, color = variable)) + 
        geom_line(lwd = 1.2) + geom_vline(xintercept = Dorian[x, 5], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "", y = "Production Schedule", color = "Model", 
        title = paste("Sensitivity Analysis of", Dorian[x, 1], "Profit per Unit Requirement")) + 
        theme(legend.position = "bottom")

    prof.plot <- ggplot(SensDat, aes(x = ModProfit, y = Profit/1e+05)) + geom_line(color = "red", 
        lwd = 2) + geom_vline(xintercept = Dorian[x, 5], color = "blue", lwd = 2, 
        alpha = 0.2) + labs(x = "Profit/Unit Sold", y = "Profit ($100,000's)") + 
        theme(legend.position = "bottom")

    multiplot(prod.plot, prof.plot)
}

##### Steel Available Sensitivity Analysis #####
SteelAvailSensitivity <- function(x) {
    library(reshape2)
    library(ggplot2)
    SteelAvailList <- replicate(n = 26 + min(25, (Materials[1, 1]/25) - 25), 
        Materials, simplify = F)
    SteelAvailList[[1]][1, 1] <- max(Materials[1, 1] - 625, 25)
    for (i in 2:length(SteelAvailList)) {
        SteelAvailList[[i]][1, 1] <- SteelAvailList[[i - 1]][1, 1] + 25
    }
    SteelAvailSens <- sapply(SteelAvailList, DorianAutoFunction, Inputs = Dorian)
    SensDat <- data.frame(t(rbind(sapply(SteelAvailSens[2, ], unlist)[1:5, ], 
        unlist(SteelAvailSens[1, ]))))
    names(SensDat) <- c(Dorian[, 1], "Profit")
    SensDat$SteelAvail <- seq(max(Materials[1, 1] - 625, 25), (length(SteelAvailList) - 
        1) * 25 + max(Materials[1, 1] - 625, 25), by = 25)
    SensDat.melt <- melt(data = SensDat, id.vars = c("SteelAvail", "Profit"), 
        measure.vars = c("Small Car", "Medium Car", "Large Car", "Medium Van", 
            "Large Van"))
    prod.plot <- ggplot(SensDat.melt, aes(x = SteelAvail, y = value, color = variable)) + 
        geom_line(lwd = 1.2) + geom_vline(xintercept = Materials[1, 1], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "", y = "Production Schedule", color = "Model", 
        title = "Sensitivity Analysis of Steel Available") + theme(legend.position = "bottom")

    prof.plot <- ggplot(SensDat, aes(x = SteelAvail, y = Profit/1e+05)) + geom_line(color = "red", 
        lwd = 2) + geom_vline(xintercept = Materials[1, 1], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "Steel Available", y = "Profit ($100,000's)") + 
        theme(legend.position = "bottom")

    multiplot(prod.plot, prof.plot)
}

##### Labor Available Sensitivity Analysis #####
LabAvailSensitivity <- function(x) {
    library(reshape2)
    library(ggplot2)
    LaborAvailList <- replicate(n = 26 + min(25, (Materials[1, 2]/250) - 25), 
        Materials, simplify = F)
    LaborAvailList[[1]][1, 2] <- max(Materials[1, 2] - 6250, 250)
    for (i in 2:length(LaborAvailList)) {
        LaborAvailList[[i]][1, 2] <- LaborAvailList[[i - 1]][1, 2] + 250
    }
    LaborAvailSens <- sapply(LaborAvailList, DorianAutoFunction, Inputs = Dorian)
    SensDat <- data.frame(t(rbind(sapply(LaborAvailSens[2, ], unlist)[1:5, ], 
        unlist(LaborAvailSens[1, ]))))
    names(SensDat) <- c(Dorian[, 1], "Profit")
    SensDat$LaborAvail <- seq(max(Materials[1, 2] - 6250, 250), (length(LaborAvailList) - 
        1) * 250 + max(Materials[1, 2] - 6250, 250), by = 250)
    SensDat.melt <- melt(data = SensDat, id.vars = c("LaborAvail", "Profit"), 
        measure.vars = c("Small Car", "Medium Car", "Large Car", "Medium Van", 
            "Large Van"))
    prod.plot <- ggplot(SensDat.melt, aes(x = LaborAvail, y = value, color = variable)) + 
        geom_line(lwd = 1.2) + geom_vline(xintercept = Materials[1, 2], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "", y = "Production Schedule", color = "Model", 
        title = "Sensitivity Analysis of Labor Available") + theme(legend.position = "bottom")

    prof.plot <- ggplot(SensDat, aes(x = LaborAvail, y = Profit/1e+05)) + geom_line(color = "red", 
        lwd = 2) + geom_vline(xintercept = Materials[1, 2], color = "blue", 
        lwd = 2, alpha = 0.2) + labs(x = "Labor Available", y = "Profit ($100,000's)") + 
        theme(legend.position = "bottom")

    multiplot(prod.plot, prof.plot)
}
Note that each of these functions requires an input variable x. This corresponds to the row number of the Dorian data frame that we are interested in exploring. The column we are interested in is given by the function we call. Lets try using one of these to see how it works:
SteelSensitivity(1)
plot of chunk unnamed-chunk-9
This breaks down what would happen if we manipulated the 1st value in the Steel Required column of the Dorian data frame (the steel required to produce a small car). The first thing to notice is that in the bottom figure, profit is decreasing from left to right. This makes sense. As the required materials increase, profit should go down. Notice that eventually profit stops decreasing. This is at the point when it is no longer most profitable to produce Small Cars, at which point it doesn't matter what we do to the steel requirement because we're no longer producing any Small Cars. In the top plot you can see how the optimal production schedule changes in response to the steel required to produce a Small car. The pale blue line in the background represents the value that is currently set in the Dorian data frame. This function does not just produce the same figures each time it's called, it recalculates each time and produces a plot that goes 25 increments above the set value and 25 increments below (or to one increment above zero, whichever comes first). I chose the increments to be proportional to the values given in the problem, so a steel increment is 0.1 tons of steel, a Labor increment is 1 hour of labor, a minimum production requirement increment is 10 units, a profit increment is $25, an increment for steel available is 25 tons, and an increment for labor available is 250 hours. Now for the really fun part, lets put this all into an easy to use GUI so that anyone can play with it and explore the problem.

The GUI Portion

One of my professors reccomended me a book on GUI building in R by Lawrence and Verzani and it proved to be extremely helpful. Programming Graphical User Interfaces in R
This book covers using 3 packages to build GUIs and I stuck with the simplest one: gWidgets. This package is designed around ease of use an because I don't have a strong background in programming I thought it would be for the best if I started out simple. gWidgets actually uses two GUI “toolboxes”, GTK and TCL/TK. I stuck with GTK for my GUI because I had heard of it before and had done some reading about it.
First I load the package and specify which of the two GUI toolboxes I want to use.
library(gWidgets)
options(guiToolkit = "RGtk2")
Next I create a window in which my GUI will exist.
window <- gwindow("Dorian Auto", visible = T)
Then I create a tabbed notebook inside my window. I never ended up adding another tab, but it's nice to have the option.
notebook <- gnotebook(cont = window)
Then I add a “group”. This is just a partition of my notebook tab.
group1 <- ggroup(cont = notebook, label = "Input Model Constraints", horizontal = F)
Then I create a “glayout” in my “group”. A “glayout” is just a nice grid format for my GUI components that saves me the trouble of organizing them on the page. It fits all of the objects I put inside it into a grid and arranges them neatly. I can add and call elements of my “glayout” much like you would reference elements inside a matrix.
lyt <- glayout(cont = group1, horizontal = T)
Notice that I put my “glayout” inside my “group”. This leaves room in my notebook tab later for graphic output. Had I skipped the “group” container and just put my “glayout” inside my notebook tab, my graphic display would be confined to one element of my “glayout”, but I want the graphic display to be much bigger than anything else on the page. Next I add a series of labels to my “glayout”. You cannot interact with labels in a GUI.
lyt[1, 1] <- glabel(text = "Model Name")
lyt[1, 2] <- glabel(text = "Steel Required/Unit")
lyt[1, 4] <- glabel(text = "Labor Required/Unit")
lyt[1, 6] <- glabel(text = "Production Minimum")
lyt[1, 8] <- glabel(text = "Profit/Unit")
lyt[1, 10] <- glabel(text = "                 ")
lyt[1, 11] <- glabel(text = "Steel Available")
lyt[1, 13] <- glabel(text = "Labor Available")
Next I add a series of text boxes underneath the “Model Name” label.
# Model Names
lyt[2, 1] <- gedit(text = Dorian[1, 1], handler = function(h, ...) {
    Dorian[1, 1] <<- svalue(h$obj)
})
lyt[3, 1] <- gedit(text = Dorian[2, 1], handler = function(h, ...) {
    Dorian[2, 1] <<- svalue(h$obj)
})
lyt[4, 1] <- gedit(text = Dorian[3, 1], handler = function(h, ...) {
    Dorian[3, 1] <<- svalue(h$obj)
})
lyt[5, 1] <- gedit(text = Dorian[4, 1], handler = function(h, ...) {
    Dorian[4, 1] <<- svalue(h$obj)
})
lyt[6, 1] <- gedit(text = Dorian[5, 1], handler = function(h, ...) {
    Dorian[5, 1] <<- svalue(h$obj)
})
The text argument is what to display as the default value of the text box and the handler argument is some function that will be called whan the text box is interacted with in the GUI. My handler function essentially says “If the text in the text box is changed, change the corresponding value in the Dorian data frame”. Next I do basically the same thing for the Dorian$Steel values, but since these values should be numeric instead of character, I use a spin button to get these values from the user instead of a text box. I can add a fromto, and by argument to mandate how the value inside the spinbutton changes when the user clicks the up or down arrow. Alternatively the user can also manually enter a number into the spin button like they would enter text into a text box.
# Steel Required
lyt[2, 2] <- gspinbutton(from = 1, to = 15, by = 0.5, value = Dorian[1, 2], 
    digits = 0, handler = function(h, ...) {
        Dorian[1, 2] <<- svalue(h$obj)
    })
lyt[3, 2] <- gspinbutton(from = 1, to = 15, by = 0.5, value = Dorian[2, 2], 
    digits = 0, handler = function(h, ...) {
        Dorian[2, 2] <<- svalue(h$obj)
    })
lyt[4, 2] <- gspinbutton(from = 1, to = 15, by = 0.5, value = Dorian[3, 2], 
    digits = 0, handler = function(h, ...) {
        Dorian[3, 2] <<- svalue(h$obj)
    })
lyt[5, 2] <- gspinbutton(from = 1, to = 15, by = 0.5, value = Dorian[4, 2], 
    digits = 0, handler = function(h, ...) {
        Dorian[4, 2] <<- svalue(h$obj)
    })
lyt[6, 2] <- gspinbutton(from = 1, to = 15, by = 0.5, value = Dorian[5, 2], 
    digits = 0, handler = function(h, ...) {
        Dorian[5, 2] <<- svalue(h$obj)
    })
Next I'll add a button next to each spin button that will call up the sensitivity analysis of that particular constraint.
# Sensitivity analysis buttons for Steel
lyt[2, 3] <- SteelButton1 <- gbutton("?", handler = function(h, ...) {
    SteelSensitivity(1)
})
lyt[3, 3] <- SteelButton2 <- gbutton("?", handler = function(h, ...) {
    SteelSensitivity(2)
})
lyt[4, 3] <- SteelButton3 <- gbutton("?", handler = function(h, ...) {
    SteelSensitivity(3)
})
lyt[5, 3] <- SteelButton4 <- gbutton("?", handler = function(h, ...) {
    SteelSensitivity(4)
})
lyt[6, 3] <- SteelButton5 <- gbutton("?", handler = function(h, ...) {
    SteelSensitivity(5)
})
Next I add spin buttons and sensitivity analysis buttons for the other constraints.
# Labor Required
lyt[2, 4] <- gspinbutton(from = 10, to = 100, by = 5, value = Dorian[1, 3], 
    digits = 0, handler = function(h, ...) {
        Dorian[1, 3] <<- svalue(h$obj)
    })
lyt[3, 4] <- gspinbutton(from = 10, to = 100, by = 5, value = Dorian[2, 3], 
    digits = 0, handler = function(h, ...) {
        Dorian[2, 3] <<- svalue(h$obj)
    })
lyt[4, 4] <- gspinbutton(from = 10, to = 100, by = 5, value = Dorian[3, 3], 
    digits = 0, handler = function(h, ...) {
        Dorian[3, 3] <<- svalue(h$obj)
    })
lyt[5, 4] <- gspinbutton(from = 10, to = 100, by = 5, value = Dorian[4, 3], 
    digits = 0, handler = function(h, ...) {
        Dorian[4, 3] <<- svalue(h$obj)
    })
lyt[6, 4] <- gspinbutton(from = 10, to = 100, by = 5, value = Dorian[5, 3], 
    digits = 0, handler = function(h, ...) {
        Dorian[5, 3] <<- svalue(h$obj)
    })

# Sensitivity analysis buttons for Labor
lyt[2, 5] <- LaborButton1 <- gbutton("?", handler = function(h, ...) {
    LaborSensitivity(1)
})
lyt[3, 5] <- LaborButton2 <- gbutton("?", handler = function(h, ...) {
    LaborSensitivity(2)
})
lyt[4, 5] <- LaborButton3 <- gbutton("?", handler = function(h, ...) {
    LaborSensitivity(3)
})
lyt[5, 5] <- LaborButton4 <- gbutton("?", handler = function(h, ...) {
    LaborSensitivity(4)
})
lyt[6, 5] <- LaborButton5 <- gbutton("?", handler = function(h, ...) {
    LaborSensitivity(5)
})

# Minimum Production Quantity
lyt[2, 6] <- gspinbutton(from = 100, to = 2000, by = 50, value = Dorian[1, 4], 
    digits = 0, handler = function(h, ...) {
        Dorian[1, 4] <<- svalue(h$obj)
    })
lyt[3, 6] <- gspinbutton(from = 100, to = 2000, by = 50, value = Dorian[2, 4], 
    digits = 0, handler = function(h, ...) {
        Dorian[2, 4] <<- svalue(h$obj)
    })
lyt[4, 6] <- gspinbutton(from = 100, to = 2000, by = 50, value = Dorian[3, 4], 
    digits = 0, handler = function(h, ...) {
        Dorian[3, 4] <<- svalue(h$obj)
    })
lyt[5, 6] <- gspinbutton(from = 100, to = 2000, by = 50, value = Dorian[4, 4], 
    digits = 0, handler = function(h, ...) {
        Dorian[4, 4] <<- svalue(h$obj)
    })
lyt[6, 6] <- gspinbutton(from = 100, to = 2000, by = 50, value = Dorian[5, 4], 
    digits = 0, handler = function(h, ...) {
        Dorian[5, 4] <<- svalue(h$obj)
    })

# Sensitivity analysis buttons for MinProduction
lyt[2, 7] <- MinProductionButton1 <- gbutton("?", handler = function(h, ...) {
    MinProductionSensitivity(1)
})
lyt[3, 7] <- MinProductionButton2 <- gbutton("?", handler = function(h, ...) {
    MinProductionSensitivity(2)
})
lyt[4, 7] <- MinProductionButton3 <- gbutton("?", handler = function(h, ...) {
    MinProductionSensitivity(3)
})
lyt[5, 7] <- MinProductionButton4 <- gbutton("?", handler = function(h, ...) {
    MinProductionSensitivity(4)
})
lyt[6, 7] <- MinProductionButton5 <- gbutton("?", handler = function(h, ...) {
    MinProductionSensitivity(5)
})

# Profit per unit
lyt[2, 8] <- gspinbutton(from = 1000, to = 10000, by = 100, value = Dorian[1, 
    5], digits = 0, handler = function(h, ...) {
    Dorian[1, 5] <<- svalue(h$obj)
})
lyt[3, 8] <- gspinbutton(from = 1000, to = 10000, by = 100, value = Dorian[2, 
    5], digits = 0, handler = function(h, ...) {
    Dorian[2, 5] <<- svalue(h$obj)
})
lyt[4, 8] <- gspinbutton(from = 1000, to = 10000, by = 100, value = Dorian[3, 
    5], digits = 0, handler = function(h, ...) {
    Dorian[3, 5] <<- svalue(h$obj)
})
lyt[5, 8] <- gspinbutton(from = 1000, to = 10000, by = 100, value = Dorian[4, 
    5], digits = 0, handler = function(h, ...) {
    Dorian[4, 5] <<- svalue(h$obj)
})
lyt[6, 8] <- gspinbutton(from = 1000, to = 10000, by = 100, value = Dorian[5, 
    5], digits = 0, handler = function(h, ...) {
    Dorian[5, 5] <<- svalue(h$obj)
})

# Sensitivity analysis buttons for Profit/Unit
lyt[2, 9] <- ModProfitButton1 <- gbutton("?", handler = function(h, ...) {
    ModProfitSensitivity(1)
})
lyt[3, 9] <- ModProfitButton2 <- gbutton("?", handler = function(h, ...) {
    ModProfitSensitivity(2)
})
lyt[4, 9] <- ModProfitButton3 <- gbutton("?", handler = function(h, ...) {
    ModProfitSensitivity(3)
})
lyt[5, 9] <- ModProfitButton4 <- gbutton("?", handler = function(h, ...) {
    ModProfitSensitivity(4)
})
lyt[6, 9] <- ModProfitButton5 <- gbutton("?", handler = function(h, ...) {
    ModProfitSensitivity(5)
})

## Resource Input
lyt[2, 11] <- gspinbutton(from = 1000, to = 10000, by = 25, value = Materials[1, 
    1], digits = 0, handler = function(h, ...) {
    Materials[1, 1] <<- svalue(h$obj)
})
lyt[2, 13] <- gspinbutton(from = 10000, to = 1e+05, by = 250, value = Materials[1, 
    2], digits = 0, handler = function(h, ...) {
    Materials[1, 2] <<- svalue(h$obj)
})
## Sensitivity Analysis
lyt[2, 12] <- ModProfitButton5 <- gbutton("?", handler = function(h, ...) {
    SteelAvailSensitivity(5)
})
lyt[2, 14] <- ModProfitButton5 <- gbutton("?", handler = function(h, ...) {
    LabAvailSensitivity(5)
})
Finally I add a button at the bottom that solves the problem with the current version of the Dorian data frame.
# Optimize button
lyt[7, 1] <- calcbutton <- gbutton("Optimize")
addHandlerClicked(calcbutton, handler = CalcFunction(Dorian, Materials))
Can't forget to add the graphics display!
# Graphics Device
group3 <- ggroup(cont = group1, horizontal = F, label = "Optimal Production Schedule Dashboard")
graphicspane1 <- ggraphics(cont = group3, width = 1000, height = 450)
Done! Try running the code and using it. It's really satisfying clicking a button and watching a new display pop up!

UPDATE: Here are a couple screen shots of the GUI in use.
What you get when you click the "Optimize" button. A graphical presentation of the solution and your Total Profit.

A sensitivity analysis, achieved by clicking the "?" button next to the constraint you are interested in Essentially answers the question "If I held all other things constant and changed this one parameter, what would the effect be?"

Update:
I mentioned towards the beginning of the post that this was a peer graded presentation. I got the results back and there was lots of good feedback. I "won" over the other two students presenting on the same problem with 61% of students choosing me. Thanks guys! Here's a breakdown of the class's response:

Who would you recommend?

Answer Response %
1 First consultant 7 21%
2 Second consultant (me) 21 62%
3 Third consultant 6 18%
Total 34 100%
 Presentation Quality


Poor Fair Good Impressive Responses
1 First consultant 0 3 20 14 37
2 Second consultant (me) 1 0 13 23 37
3 Third consultant 0 1 24 9 34
  Model and support


Poor Fair Good Impressive Responses
1 First consultant 0 2 22 13 37
2 Second consultant (me) 1 1 9 26 37
3 Third consultant 0 3 25 6 34


Cheers everyone!

1

View comments

  1. I guess the ones who rated you "Poor" on some of those were Excel jockeys. :-) Good stuff! Keep the posts coming!

    ReplyDelete
  1. Purpose

    The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows. This works well for regular time series, but what if your observations aren’t recorded at regular intervals? How can you divide your data into training/validation sets that span fixed time intervals instead of a fixed number of rows?
    fixedWindow and horizon Illustrated
    fixedWindow and horizon Illustrated
    Allow me to present a solution:
    createIrregularTimeSlices <- function(y, initialWindow, horizon = 1, unit = c("sec", "min", "hour", "day", "week", "month", "year", "quarter"), fixedWindow = TRUE, skip = 0) {
      if(inherits(y, 'Date')) y <- as.POSIXct(y)
      stopifnot(inherits(y, 'POSIXt'))
      
      # generate the sequence of date/time values over which to split. These will always be in ascending order, with no missing date/times.
      yvals <- seq(from = lubridate::floor_date(min(y), unit), 
                   to = lubridate::ceiling_date(max(y), unit), 
                   by = unit)
      
      # determine the start and stop date/times for each time slice
      stops <- seq_along(yvals)[initialWindow:(length(yvals) - horizon)]
      if (fixedWindow) {
        starts <- stops - initialWindow + 1
      }else {
        starts <- rep(1, length(stops))
      }
      
      # function that returns the indices of y that are between the start and stop date/time for a slice 
      ind <- function(start, stop, y, yvals) {
        which(y > yvals[start] & y <= yvals[stop])
      }
      train <- mapply(ind, start = starts, stop = stops, MoreArgs = list(y = y, yvals = yvals), SIMPLIFY = FALSE)
      test <- mapply(ind, start = stops, stop = (stops + horizon), MoreArgs = list(y = y, yvals = yvals), SIMPLIFY = FALSE)
      names(train) <- paste("Training", gsub(" ", "0", format(seq(along = train))), sep = "")
      names(test) <- paste("Testing", gsub(" ", "0", format(seq(along = test))), sep = "")
      
      # reduce the number of slices returned if skip > 0
      if (skip > 0) {
        thin <- function(x, skip = 2) {
          n <- length(x)
          x[seq(1, n, by = skip)]
        }
        train <- thin(train, skip = skip + 1)
        test <- thin(test, skip = skip + 1)
      }
      
      # eliminate any slices that have no observations in either the training set or the validation set
      empty <- c(which(sapply(train, function(x) length(x) == 0)),
                 which(sapply(test, function(x) length(x) == 0)))
      if(length(empty) > 0){
        train <- train[-empty]
        test <- test[-empty]
      }
      
      out <- list(train = train, test = test)
      out
    }
    Some features to note:
    • It doesn’t matter what order y is in when passed to the function.
    • It doesn’t matter if there are unrepresented time periods in y. The function groups data by unit, using all units in range(y), whether or not there is an observation within each unit.
    • If units without any observations result in a partition with an empty training set or an empty validation set, that partition is not returned.

    Example

    For starters, we need a data set with a date/time variable. Lets use the economics data included in ggplot2.
    library(ggplot2)
    data(economics)
    Next, lets use createIrregularTimeSlices() to create data partitions. I’ll use a fixed window of 20 quarters for training data, to be validated on the following 4 quarters. There are 170 possible 20/4 month training/validation sets in the data. To reduce the number of trainin/validation combinations, I use the skip argument to only keep every fourth resample, reducing the number of resamples and thus reducing the training time.
    my_partitions <- createIrregularTimeSlices(economics$date, initialWindow = 20, horizon = 4, unit = "quarter", fixedWindow = T, skip = 4)
    Finally, lets use the partitions to train a model.
    library(caret)
    library(mgcv)
    library(nlme)
    ctrl <- trainControl(index = my_partitions$train, indexOut = my_partitions$test)
    mod <- train(psavert ~ pce + pop + uempmed + unemploy, data = economics, method = 'gam', trControl = ctrl)
    Note that caret calculates the average performance across resamples. createIrregularTimeSlices() can produce resamples with varying sample sizes in the validation set, so you may want to take a weighted average of the calculated performance values, weighted by the sample size in the validation set.
    The indices created with createIrregularTimeSlices() are stored within the caret model object, so you can inspect them later to retrive the training/validation sample sizes.
    training_sample_size <- sapply(mod$control$index, length)
    validation_sample_size <- sapply(mod$control$indexOut, length)
    cbind(training_sample_size, validation_sample_size)
    ##             training_sample_size validation_sample_size
    ## Training001                   55                     12
    ## Training006                   57                     12
    ## Training011                   57                     12
    ## Training016                   57                     12
    ## Training021                   57                     12
    ## Training026                   57                     12
    ## Training031                   57                     12
    ## Training036                   57                     12
    ## Training041                   57                     12
    ## Training046                   57                     12
    ## Training051                   57                     12
    ## Training056                   57                     12
    ## Training061                   57                     12
    ## Training066                   57                     12
    ## Training071                   57                     12
    ## Training076                   57                     12
    ## Training081                   57                     12
    ## Training086                   57                     12
    ## Training091                   57                     12
    ## Training096                   57                     12
    ## Training101                   57                     12
    ## Training106                   57                     12
    ## Training111                   57                     12
    ## Training116                   57                     12
    ## Training121                   57                     12
    ## Training126                   57                     12
    ## Training131                   57                     12
    ## Training136                   57                     12
    ## Training141                   57                     12
    ## Training146                   57                     12
    ## Training151                   57                     12
    ## Training156                   57                     12
    ## Training161                   57                     12
    ## Training166                   57                     12
    0

    Add a comment

  2. Introduction

    Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.
    library(data.table)

    The Setup

    Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments. We track user sessions on our website and PayPal tracks our payments, and we would like to attribute sales to user sessions. This way we can answer all kinds of questions along the lines of “What types of sessions lead to sales?”
    Lets create some example data for a diverse group of customers.
    Indecisive Isabel shops a lot before she buys anything.
    isabel_website <- data.table(name = rep('Indecisive Isabel', 5),
                         session_start_time = as.POSIXct(c('2016-01-01 11:01', '2016-01-02 8:59', '2016-01-05 18:18', '2016-01-07 19:03', '2016-01-08 19:01')))
    isabel_paypal <- data.table(name = 'Indecisive Isabel', purchase_time = as.POSIXct('2016-01-08 19:10'))
    Spendy Sally visits the website once and makes multiple purchases.
    sally_website <- data.table(name = 'Spendy Sally', session_start_time = as.POSIXct('2016-01-03 10:00'))
    sally_paypal <- data.table(name = rep('Spendy Sally', 2), purchase_time = as.POSIXct(c('2016-01-03 10:06', '2016-01-03 10:15')))
    Frequent Francis comes to the site a lot and sometimes buys things.
    francis_website <- data.table(name = rep('Frequent Francis', 6),
                         session_start_time = as.POSIXct(c('2016-01-02 13:09', '2016-01-03 19:22', '2016-01-08 8:44', '2016-01-08 20:22', '2016-01-10 17:36', '2016-01-15 16:56')))
    francis_paypal <- data.table(name = rep('Frequent Francis', 3), purchase_time = as.POSIXct(c('2016-01-03 19:28', '2016-01-08 20:33', '2016-01-10 17:46')))
    Error-prone Erica has mysteriously managed to make a purchase before ever visiting the site!
    erica_website <- data.table(name = rep('Error-prone Erica', 2),
                         session_start_time = as.POSIXct(c('2016-01-04 19:12', '2016-01-04 21:05')))
    erica_paypal <- data.table(name = 'Error-prone Erica', purchase_time = as.POSIXct('2016-01-03 08:02'))
    Visitor Vivian visits our website a couple times, but never makes a purchase (so she appears in the website data, but not in the payment data).
    vivian_website <- data.table(name = rep('Visitor Vivian', 2),
                         session_start_time = as.POSIXct(c('2016-01-01 9:10', '2016-01-09 2:15')))
    vivian_paypal <- erica_paypal[0] # has 0 rows, but the same column names/classes
    And Mom sent money to my PayPal account before my website was up and running (so she appears in the payment data, but not in the website data).
    mom_website <- vivian_website[0] # has 0 rows, but the same column names/classes
    mom_paypal <- data.table(name = 'Mom', purchase_time = as.POSIXct('2015-12-02 17:58'))
    Combine these into two data.tables:
    website <- rbindlist(list(isabel_website, sally_website, francis_website, erica_website, vivian_website, mom_website))
    paypal <- rbindlist(list(isabel_paypal, sally_paypal, francis_paypal, erica_paypal, vivian_paypal, mom_paypal))
    To keep things straight, lets give each website session a unique ID and each payment a unique ID.
    website[, session_id:=.GRP, by = .(name, session_start_time)]
    paypal[, payment_id:=.GRP, by = .(name, purchase_time)]
    Finally, have a look at the data.
    website
    ##                  name  session_start_time session_id
    ##  1: Indecisive Isabel 2016-01-01 11:01:00          1
    ##  2: Indecisive Isabel 2016-01-02 08:59:00          2
    ##  3: Indecisive Isabel 2016-01-05 18:18:00          3
    ##  4: Indecisive Isabel 2016-01-07 19:03:00          4
    ##  5: Indecisive Isabel 2016-01-08 19:01:00          5
    ##  6:      Spendy Sally 2016-01-03 10:00:00          6
    ##  7:  Frequent Francis 2016-01-02 13:09:00          7
    ##  8:  Frequent Francis 2016-01-03 19:22:00          8
    ##  9:  Frequent Francis 2016-01-08 08:44:00          9
    ## 10:  Frequent Francis 2016-01-08 20:22:00         10
    ## 11:  Frequent Francis 2016-01-10 17:36:00         11
    ## 12:  Frequent Francis 2016-01-15 16:56:00         12
    ## 13: Error-prone Erica 2016-01-04 19:12:00         13
    ## 14: Error-prone Erica 2016-01-04 21:05:00         14
    ## 15:    Visitor Vivian 2016-01-01 09:10:00         15
    ## 16:    Visitor Vivian 2016-01-09 02:15:00         16
    paypal
    ##                 name       purchase_time payment_id
    ## 1: Indecisive Isabel 2016-01-08 19:10:00          1
    ## 2:      Spendy Sally 2016-01-03 10:06:00          2
    ## 3:      Spendy Sally 2016-01-03 10:15:00          3
    ## 4:  Frequent Francis 2016-01-03 19:28:00          4
    ## 5:  Frequent Francis 2016-01-08 20:33:00          5
    ## 6:  Frequent Francis 2016-01-10 17:46:00          6
    ## 7: Error-prone Erica 2016-01-03 08:02:00          7
    ## 8:               Mom 2015-12-02 17:58:00          8

    The Joins

    Before doing any rolling joins, I like to create a separate date/time column in each table to join on because one of the two tables loses it’s date/time field and I can never remember which.
    website[, join_time:=session_start_time]
    paypal[, join_time:=purchase_time]
    Next, set keys on each table. The last key column is the one the rolling join will “roll” on. We want to first join on name and then within each name, match website sessions to purchases. So we key on name first, then on the newly created join_time.
    setkey(website, name, join_time)
    setkey(paypal, name, join_time)

    Rolling Forward

    Now let’s answer the question “what website session immediately preceded each payment?”
    website[paypal, roll = T] # equivalent to website[paypal, roll = Inf]
    ##                 name  session_start_time session_id           join_time
    ## 1: Error-prone Erica                <NA>         NA 2016-01-03 08:02:00
    ## 2:  Frequent Francis 2016-01-03 19:22:00          8 2016-01-03 19:28:00
    ## 3:  Frequent Francis 2016-01-08 20:22:00         10 2016-01-08 20:33:00
    ## 4:  Frequent Francis 2016-01-10 17:36:00         11 2016-01-10 17:46:00
    ## 5: Indecisive Isabel 2016-01-08 19:01:00          5 2016-01-08 19:10:00
    ## 6:               Mom                <NA>         NA 2015-12-02 17:58:00
    ## 7:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:06:00
    ## 8:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:15:00
    ##          purchase_time payment_id
    ## 1: 2016-01-03 08:02:00          7
    ## 2: 2016-01-03 19:28:00          4
    ## 3: 2016-01-08 20:33:00          5
    ## 4: 2016-01-10 17:46:00          6
    ## 5: 2016-01-08 19:10:00          1
    ## 6: 2015-12-02 17:58:00          8
    ## 7: 2016-01-03 10:06:00          2
    ## 8: 2016-01-03 10:15:00          3
    Notice several things about this result:
    • Each payment is matched to the closest preceding payment. all(purchase_time > session_start_time, na.rm = T) evaluates to TRUE.
    • Payments with no preceding sessions still appear in the result (that is, nrow(result) == nrow(paypal)).
    • Visitor Vivian does not appear in the results because she does not appear in the paypal table.
    • Mom’s “purchase” has no website session associated with it because she has never visited the website at all.
    • Error-prone Erica’s mysterious purchase has no website session associated with it because she never visited the website prior to her purchase.
    • Spendy Sally’s one website session is matched to both of her purchases.

    Rolling Backward

    Now lets switch the order of the two tables and answer the question “which sessions led to a purchase?” In this case, we want to match payments to website sessions, so long as the payment occurred after the beginning of the website session.
    paypal[website, roll = -Inf]
    ##                  name       purchase_time payment_id           join_time
    ##  1: Error-prone Erica                <NA>         NA 2016-01-04 19:12:00
    ##  2: Error-prone Erica                <NA>         NA 2016-01-04 21:05:00
    ##  3:  Frequent Francis 2016-01-03 19:28:00          4 2016-01-02 13:09:00
    ##  4:  Frequent Francis 2016-01-03 19:28:00          4 2016-01-03 19:22:00
    ##  5:  Frequent Francis 2016-01-08 20:33:00          5 2016-01-08 08:44:00
    ##  6:  Frequent Francis 2016-01-08 20:33:00          5 2016-01-08 20:22:00
    ##  7:  Frequent Francis 2016-01-10 17:46:00          6 2016-01-10 17:36:00
    ##  8:  Frequent Francis                <NA>         NA 2016-01-15 16:56:00
    ##  9: Indecisive Isabel 2016-01-08 19:10:00          1 2016-01-01 11:01:00
    ## 10: Indecisive Isabel 2016-01-08 19:10:00          1 2016-01-02 08:59:00
    ## 11: Indecisive Isabel 2016-01-08 19:10:00          1 2016-01-05 18:18:00
    ## 12: Indecisive Isabel 2016-01-08 19:10:00          1 2016-01-07 19:03:00
    ## 13: Indecisive Isabel 2016-01-08 19:10:00          1 2016-01-08 19:01:00
    ## 14:      Spendy Sally 2016-01-03 10:06:00          2 2016-01-03 10:00:00
    ## 15:    Visitor Vivian                <NA>         NA 2016-01-01 09:10:00
    ## 16:    Visitor Vivian                <NA>         NA 2016-01-09 02:15:00
    ##      session_start_time session_id
    ##  1: 2016-01-04 19:12:00         13
    ##  2: 2016-01-04 21:05:00         14
    ##  3: 2016-01-02 13:09:00          7
    ##  4: 2016-01-03 19:22:00          8
    ##  5: 2016-01-08 08:44:00          9
    ##  6: 2016-01-08 20:22:00         10
    ##  7: 2016-01-10 17:36:00         11
    ##  8: 2016-01-15 16:56:00         12
    ##  9: 2016-01-01 11:01:00          1
    ## 10: 2016-01-02 08:59:00          2
    ## 11: 2016-01-05 18:18:00          3
    ## 12: 2016-01-07 19:03:00          4
    ## 13: 2016-01-08 19:01:00          5
    ## 14: 2016-01-03 10:00:00          6
    ## 15: 2016-01-01 09:10:00         15
    ## 16: 2016-01-09 02:15:00         16
    In this result
    • Each website session is match to the nearest following payment. all(session_start_time > purchase_time, na.rm = T) evaluates toTRUE.
    • Mom does not appear because she has no record in the website table.
    • Visitor Vivian’s sessions are not matched to any purchases because she hasn’t purchased anything.
    • Neither of Erica’s website sessions are matched to her purchase because it took place before both sessions.
    • Frequent Francis’s most recent session isn’t matched to a purchase because she hasn’t made a purchase after that session.
    • All of Indecisive Isabel’s sessions are matched to her one purchase. In fact, several purchases appear more than once.

    Rolling Windows

    What if we wanted to add an additional criteria to the rolling join above: match payments to website sessions, so long as the payment occurred after the beginning of the website session and within 12 hours of the website session?
    twelve_hours <- 60*60*20 # 12 hours = 60 sec * 60 min * 12 hours
    paypal[website, roll = -twelve_hours]
    ##                  name       purchase_time payment_id           join_time
    ##  1: Error-prone Erica                <NA>         NA 2016-01-04 19:12:00
    ##  2: Error-prone Erica                <NA>         NA 2016-01-04 21:05:00
    ##  3:  Frequent Francis                <NA>         NA 2016-01-02 13:09:00
    ##  4:  Frequent Francis 2016-01-03 19:28:00          4 2016-01-03 19:22:00
    ##  5:  Frequent Francis 2016-01-08 20:33:00          5 2016-01-08 08:44:00
    ##  6:  Frequent Francis 2016-01-08 20:33:00          5 2016-01-08 20:22:00
    ##  7:  Frequent Francis 2016-01-10 17:46:00          6 2016-01-10 17:36:00
    ##  8:  Frequent Francis                <NA>         NA 2016-01-15 16:56:00
    ##  9: Indecisive Isabel                <NA>         NA 2016-01-01 11:01:00
    ## 10: Indecisive Isabel                <NA>         NA 2016-01-02 08:59:00
    ## 11: Indecisive Isabel                <NA>         NA 2016-01-05 18:18:00
    ## 12: Indecisive Isabel                <NA>         NA 2016-01-07 19:03:00
    ## 13: Indecisive Isabel 2016-01-08 19:10:00          1 2016-01-08 19:01:00
    ## 14:      Spendy Sally 2016-01-03 10:06:00          2 2016-01-03 10:00:00
    ## 15:    Visitor Vivian                <NA>         NA 2016-01-01 09:10:00
    ## 16:    Visitor Vivian                <NA>         NA 2016-01-09 02:15:00
    ##      session_start_time session_id
    ##  1: 2016-01-04 19:12:00         13
    ##  2: 2016-01-04 21:05:00         14
    ##  3: 2016-01-02 13:09:00          7
    ##  4: 2016-01-03 19:22:00          8
    ##  5: 2016-01-08 08:44:00          9
    ##  6: 2016-01-08 20:22:00         10
    ##  7: 2016-01-10 17:36:00         11
    ##  8: 2016-01-15 16:56:00         12
    ##  9: 2016-01-01 11:01:00          1
    ## 10: 2016-01-02 08:59:00          2
    ## 11: 2016-01-05 18:18:00          3
    ## 12: 2016-01-07 19:03:00          4
    ## 13: 2016-01-08 19:01:00          5
    ## 14: 2016-01-03 10:00:00          6
    ## 15: 2016-01-01 09:10:00         15
    ## 16: 2016-01-09 02:15:00         16
    Now Indecisive Isabel’s last session only is associated with a purchase.

    The rollends Argument

    Recall the first join from above, matching the preceding website session to each payment.
    website[paypal, roll = T] # equivalent to website[paypal, roll = T, rollends = c(F, T)]
    ##                 name  session_start_time session_id           join_time
    ## 1: Error-prone Erica                <NA>         NA 2016-01-03 08:02:00
    ## 2:  Frequent Francis 2016-01-03 19:22:00          8 2016-01-03 19:28:00
    ## 3:  Frequent Francis 2016-01-08 20:22:00         10 2016-01-08 20:33:00
    ## 4:  Frequent Francis 2016-01-10 17:36:00         11 2016-01-10 17:46:00
    ## 5: Indecisive Isabel 2016-01-08 19:01:00          5 2016-01-08 19:10:00
    ## 6:               Mom                <NA>         NA 2015-12-02 17:58:00
    ## 7:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:06:00
    ## 8:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:15:00
    ##          purchase_time payment_id
    ## 1: 2016-01-03 08:02:00          7
    ## 2: 2016-01-03 19:28:00          4
    ## 3: 2016-01-08 20:33:00          5
    ## 4: 2016-01-10 17:46:00          6
    ## 5: 2016-01-08 19:10:00          1
    ## 6: 2015-12-02 17:58:00          8
    ## 7: 2016-01-03 10:06:00          2
    ## 8: 2016-01-03 10:15:00          3
    What if we want the rolling join to handle Error-prone Erica’s case differently? Perhaps in cases like hers, where there is a purchase with no preceding session, we prefer the user’s first website session to be matched to the offending purchase. We can use the rollends argument for this. From the data.table documentation (?data.table),
    rollends[1]=TRUE will roll the first value backwards if the value is before it
    website[paypal, roll = T, rollends = c(T, T)] # equivalent to website[paypal, roll = T, rollends = T]
    ##                 name  session_start_time session_id           join_time
    ## 1: Error-prone Erica 2016-01-04 19:12:00         13 2016-01-03 08:02:00
    ## 2:  Frequent Francis 2016-01-03 19:22:00          8 2016-01-03 19:28:00
    ## 3:  Frequent Francis 2016-01-08 20:22:00         10 2016-01-08 20:33:00
    ## 4:  Frequent Francis 2016-01-10 17:36:00         11 2016-01-10 17:46:00
    ## 5: Indecisive Isabel 2016-01-08 19:01:00          5 2016-01-08 19:10:00
    ## 6:               Mom                <NA>         NA 2015-12-02 17:58:00
    ## 7:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:06:00
    ## 8:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:15:00
    ##          purchase_time payment_id
    ## 1: 2016-01-03 08:02:00          7
    ## 2: 2016-01-03 19:28:00          4
    ## 3: 2016-01-08 20:33:00          5
    ## 4: 2016-01-10 17:46:00          6
    ## 5: 2016-01-08 19:10:00          1
    ## 6: 2015-12-02 17:58:00          8
    ## 7: 2016-01-03 10:06:00          2
    ## 8: 2016-01-03 10:15:00          3
    In this result, Erica’s first session is matched to her purchase, even though the session was after her purchase. Mom’s “purchase” still has no matching session because Mom does not appear in the website table. So all(purchase_time > session_start_time, na.rm = T) no longer evaluates to TRUE.
    What if we want to perform the same join as above, but only returning matches for payments with sessions before and after?
    website[paypal, roll = T, rollends = c(F, F)] # equivalent to website[paypal, roll = T, rollends = F]
    ##                 name  session_start_time session_id           join_time
    ## 1: Error-prone Erica                <NA>         NA 2016-01-03 08:02:00
    ## 2:  Frequent Francis 2016-01-03 19:22:00          8 2016-01-03 19:28:00
    ## 3:  Frequent Francis 2016-01-08 20:22:00         10 2016-01-08 20:33:00
    ## 4:  Frequent Francis 2016-01-10 17:36:00         11 2016-01-10 17:46:00
    ## 5: Indecisive Isabel                <NA>         NA 2016-01-08 19:10:00
    ## 6:               Mom                <NA>         NA 2015-12-02 17:58:00
    ## 7:      Spendy Sally                <NA>         NA 2016-01-03 10:06:00
    ## 8:      Spendy Sally                <NA>         NA 2016-01-03 10:15:00
    ##          purchase_time payment_id
    ## 1: 2016-01-03 08:02:00          7
    ## 2: 2016-01-03 19:28:00          4
    ## 3: 2016-01-08 20:33:00          5
    ## 4: 2016-01-10 17:46:00          6
    ## 5: 2016-01-08 19:10:00          1
    ## 6: 2015-12-02 17:58:00          8
    ## 7: 2016-01-03 10:06:00          2
    ## 8: 2016-01-03 10:15:00          3
    In this result, the purchases of Error-prone Erica and Mom are unmatched because they have no preceding sessions, and Spendy Sally’s two purchases are unmatched because they have no following website session.
    Note that when roll is set to a negative number, the meaning of the two rollends elements kind of flip-flops:
    website[paypal, roll = -Inf, rollends = c(F, T)] # default when roll < 0 is rollends = c(T, F)
    ##                 name  session_start_time session_id           join_time
    ## 1: Error-prone Erica                <NA>         NA 2016-01-03 08:02:00
    ## 2:  Frequent Francis 2016-01-08 08:44:00          9 2016-01-03 19:28:00
    ## 3:  Frequent Francis 2016-01-10 17:36:00         11 2016-01-08 20:33:00
    ## 4:  Frequent Francis 2016-01-15 16:56:00         12 2016-01-10 17:46:00
    ## 5: Indecisive Isabel 2016-01-08 19:01:00          5 2016-01-08 19:10:00
    ## 6:               Mom                <NA>         NA 2015-12-02 17:58:00
    ## 7:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:06:00
    ## 8:      Spendy Sally 2016-01-03 10:00:00          6 2016-01-03 10:15:00
    ##          purchase_time payment_id
    ## 1: 2016-01-03 08:02:00          7
    ## 2: 2016-01-03 19:28:00          4
    ## 3: 2016-01-08 20:33:00          5
    ## 4: 2016-01-10 17:46:00          6
    ## 5: 2016-01-08 19:10:00          1
    ## 6: 2015-12-02 17:58:00          8
    ## 7: 2016-01-03 10:06:00          2
    ## 8: 2016-01-03 10:15:00          3
    In this example,
    • Each payment is matched to the nearest following website session (because of roll = -Inf).
    • Error-prone Erica’s purchase is not matched to the following session because there is no previous session (due to rollends[1] = F).
    • Spendy Sally’s purchases are joined to her most recent website session, even though it occurred before her purchases (because ofrollends[2] = T).
    2

    View comments

  3. Introduction

    In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method. The post is divided into two parts, Motivation and Code. If you’re savvy to the topic of class imbalanced data and you just want to see the code already, skip ahead to the Code section.

    Motivation

    Recently I was working on a 2-class classification problem where my data suffered from a fairly severe class imbalance (I had a lot more “non-events” than “events”). As far as I know, there are three ways to deal with this:
    1. Use an imbalanced cost matrix for your model.
    2. Over-sample “events”, under-sample “non-events”, or some combination of the two.
    3. Fit a classification model as you normally would, use it to generate the probability of an instance being in the “event” class, and then play with your cutoff point - that is, experiment with raising or lowering the probability of being an “event” at which you draw the line between “event” and “non-event”.
    The first approach is a good one. Using an imbalanced cost matrix makes different errors “cost” different amounts. In my case, with “events” being rare, it makes sense that calling an event a non-event (missing a rare event) should be more expensive than calling a non-event an event (a false positive). Imposing costs this way actually causes your model parameters to adjust accordingly, whereas option 3 above does not. The major problem with this approach is that not all modelling techniques support non-uniform cost matrices. As far as I know, within the caret framework only three model types support cost-sensitive model fitting: C5.0, CART, and SVMs.
    The second approach (over/under sampling) is perfectly fine if you have the data to do it. However, in my case, I had 15% events and 85% non-events. To balance the two classes via under-sampling the non-events I would have to randomly select 15% of the non-events and discard 70% of my data. I didn’t have enough data to feel confident doing this. On the other hand, to balance my data via over-sampling the events, I would have to include each observation that was an event 5 to 6 times. Having this many duplicate data points would surely bias my model. There are ways to randomly “fuzz” your over-sampled observations so you’re not just including a bunch of duplicate observations, but for reasons I won’t get into, I didn’t see this as a viable option with my data. Of course you could compromise and mix over-sampling and under-sampling, but I decided against this. There are actually really clever ways of doing this and one should not write it off as a bad or sub-optimal strategy for dealing with class imbalanced data.
    The third approach is a commonly used one. Typically the strategy looks like this:
    • Partition your data into a training/validation set and a test set (the common wisdom is 90% training/validation, 10% test).
    • Estimate model hyperparameters (also called tuning parameters) using some sort of cross-validation with your training/validation data.
    • With hyperparameters picked out, fit a model to the entire training/validation set (without leaving out a fold for cross-validation purposes).
    • Generate class probabilities for the test set using the model from the previous step.
    • Using the predicted class probabilities and known classes of the test data, find the cutoff point that yields the “best” class predictions, defining “best” however you like. Note that common measures of model performance such as area under the ROC curve are independent of the chosen cutoff point, so such measures are not appropriate here. A good measure I’ve seen is the euclidean distance between your models’ sensitivity/specificity and the ideal sensitivity/specificity of 1/1, that is, Performance=(Sensitivity1)2+(Specificity1)2
    • Re-fit the model with the chosen hyperparameters, this time on the entire data set.
    • Use this model to generate class probabilities for new observations and the chosen cutoff point to score the new data as an event or non-event.
    There are two major drawbacks to this method. The first is that it “costs” you data when searching for optimal hyperparameters - you’re only using 90% of what’s available. This isn’t a big deal if you’ve got plenty of data. The second drawback is that the cutoff point you choose is biased towards a small portion of your data. You only used 10% of your data (with no cross-validation or anything!) to estimate the “best” cutoff point. There is the additional criticism that you trained your model on unbalanced data, which produces a model that is very good at identifying the prevalent class, but poor at identifying the rare class. A very good blog post I read recently suggests that while this criticism is not unfounded, a model fit to unbalanced data does not under-perform by very much. Choosing a cutoff probability other than 0.5 also corrects the “unbalanced” nature of your model.
    The caret package provides an excellent framework for model fitting/validation, serving as an API of sorts to many different R packages with different modelling procedures. It also allows you to define your own modelling procedure if you wish. There are several examples of this on thecaret package website and one of these examples demonstrates adding the probability cutoff point discussed above to the list of a model’s hyperparameters. This is an interesting technique that solves the problem of data “cost” and bias in the selection of a cutoff probability other than the default 0.5. The resulting procedure follows this algorithm:
    define the values of hyperparameters over which you would like to grid search
    define the values of cutoff point over which you would like to search
    for each combination of hyperparameters H_i {
      for each cross-validation fold F_j {
        fit a model M_{i,j} using hyperparameters H_i and all data except F_j
        use model M_{i,j} to predict class probabilities for hold out set F_j
        for each value of cutoff point C_k {
          generate class predictions for hold out set F_j from class probabilities using C_k as the threshold
          evaluate the performance of these predictions
        }
      }
    }
    Assuming you are using ten-fold cross validation repeated five times, following this algorithm will yield 50 estimates of model performance for every hyperparameter/threshold pair, without costing any extra data to estimate the threshold value. From these 50 performance estimates you can calculate a mean and standard error and choose the hyperparameter/threshold pair that yields the best model performance just as you would choose hyperparameters for any model. Not only does this not cost any extra data, but your choice of cutoff point is much less biased than had you estimated it using just one hold out set.
    A reflexive argument against this is “but now I’ve massively increased the size of my grid search!” This is true; If you want to search over 10 values of hyperparameter A, 10 values of hyperparameter B, and 10 values of cutoff point, you now have 10×10×10=1000 parameter combinations to search over instead of just 10×10=100 combinations if you only grid search over the hyperparameters. However, you don’t need to fit a separate model to evaluate each cutoff value. You only need to fit a model for each combination of hyperparameters/cross-validation folds and then generate class probabilities (once) and score the data as “event” or “non-event” using each cutoff probability. So while the grid you must search over expands, the added time cost is just the time it takes to generate some class probabilities and execute a bunch of if-else logic. This is negligible in comparison to the time it takes to fit all those models.

    Code

    The example of creating a custom modelling procedure on the caret package website adds this threshold optimizing functionality to therandom forest method. I used this example to add the functionality to the C5.0 method. For kicks, I will also add the choice to use/not use fuzzy thresholding as a Boolean model hyperparameter.
    To start, let’s grab the code behind the default C5.0 method and inspect it.
    library(caret)
    my_mod <- getModelInfo("C5.0", regex = FALSE)[[1]]
    names(my_mod)
     [1] "label"      "library"    "loop"       "type"       "parameters"
     [6] "grid"       "fit"        "predict"    "prob"       "predictors"
    [11] "varImp"     "tags"       "sort"      
    So there are 13 elements in this list that define exactly what train() does when you pass it the argument method="C5.0". We will need to edit some of these, but not all of them. It’s definitely easier to start from the default C5.0 method code and make changes as needed than to start with an empty list and define each list element from scratch.

    label

    The element label just provides the name of the algorithm used. I suppose we could change this to read "C5.0thresh" or something rude to pester our colleagues, but lets resist the temptation (oh alright, if you must, my_mod$label <- "JimsPasswordIsHisBirthday").
    my_mod$label
    [1] "C5.0"

    tags

    Similarly, there is no need to modify the model’s tags element, but if you must,my_mod$tags <- c(my_mod$tags, "awesome sauce", "custom model best model", "Jims birthday is 8/16/85")

    library

    The element library provides a character vector of libraries that must be loaded for the modelling code to be run. We aren’t going to add anything that uses a different library, so we don’t need to change this.

    type

    The type element simply specifies whether or not the model procedure is used for classification, regression, or both. C5.0 is a strictly classification algorithm, and we aren’t about to change that.

    parameters

    The parameters element lists the model’s tuning parameters (also called hyperparameters) and describes them. The stock C5.0 method parameters are:
    my_mod$parameters
      parameter     class                 label
    1    trials   numeric # Boosting Iterations
    2     model character            Model Type
    3    winnow   logical                Winnow
    Let’s add a cutoff parameter and a fuzzy parameter. The probability cutoff will be a numeric scalar and fuzzy should be a Boolean indicator of whether or not to use fuzzy thresholding to fit the model.
    Note that the parameter trials is of type numeric even though the number of trials will always take on an integer value. Classes assigned here be one of numericcharacteror logical.
    my_mod$parameters <- data.frame(
      parameter = c("trials", "model", "winnow", "fuzzy", "cutoff"),
      class = c("numeric", "character", "logical", "logical", "numeric"),
      label = c("# Boosting Iterations", "Model Type", "Winnow", "Fuzzy Thresholding", "Probability Cutoff"))

    grid

    The grid element is a function that will be used to create a grid of hyperparameters to search over if a grid of parameters is not explicitly passed to train() via the tuneGrid argument. Currently the grid function only produces a grid of parameters for trialsmodel, and winnow. Lets update this to create a grid that also includes fuzzy and cutoff. Again I’m starting from the existing code and just adding.
    The len argument here is passed from the tuneLength argument of train(), which the documentation defines as “an integer denoting the number of levels for each tuning parameter that should be generated by train.”
    my_mod$grid <- function(x, y, len = NULL){
      # values of "trials" to try
      c5seq <- if(len==1)  1 else  c(1, 10*((2:min(len, 11)) - 1))
      # values of "cutoff" to try
      cutoff_seq <- seq(.01, .99, length=len)
      
      # produce the grid
      expand.grid(trials = c5seq, 
                  model = c("tree", "rules"), 
                  winnow = c(TRUE, FALSE), 
                  fuzzy = c(TRUE, FALSE), 
                  cutoff = cutoff_seq)
    }

    fit

    The fit element should be a function for actually fitting the model. This function calls the C5.0() function (from the C50 package) and returns a fitted model. It should pass hyperparameter values from the param argument to the model fitting function, as well as any other arguments the user may have passed to train() through the ... interface that are intended for the fitting function. I will also have this function throw an error if the target variable has more than two classes since this custom model fitting procedure is only intended for the case where we have two classes.
    Note that this fitting function does not reference the cutoff point at all. The cutoff point is estimated independently of the model fitting.
    my_mod$fit <- function(x, y, wts, param, lev, last, classProbs, ...){ 
      
      # throw an error if there are more than 2 classes in the target variable
      if(length(levels(y)) != 2){
        stop("This only works for 2-class problems!")
      }
      
      # grab any arguments passed from train through ...
      theDots <- list(...)
      
      # if one of the arguments is a C5.0Control object,
      if(any(names(theDots) == "control")){ 
        # change its value of winnow to match the value given in param
        theDots$control$winnow <- param$winnow
        # change its value of fuzzyThreshold to match the value given in param
        theDots$control$fuzzyThreshold <- param$fuzzy
      }else{
      # if a C5.0Control object is not given, create one
        theDots$control <- C5.0Control(winnow = param$winnow,
                                       fuzzyThreshold = param$fuzzy)
      }
      
      # put all of the model fitting arguments into one list
      argList <- list(x = x, y = y, weights = wts, trials = param$trials,
                      rules = param$model == "rules")
      # append anything left in ...
      argList <- c(argList, theDots)
      
      # fit a C5.0 model using the argumets in argList
      do.call("C5.0.default", argList)
    }

    loop

    The loop element is a clever construct that can be used to derive predictions for several models from just one model object. Take a look at the default C5.0 loop function.
    my_mod$loop
    function(grid) {     
                        loop <- ddply(grid, c("model", "winnow"),
                                      function(x) c(trials = max(x$trials)))                 
                        
                        submodels <- vector(mode = "list", length = nrow(loop))
                        for(i in seq(along = loop$trials))
                        {
                          index <- which(grid$model == loop$model[i] & 
                                           grid$winnow == loop$winnow[i])
                          trials <- grid[index, "trials"] 
                          submodels[[i]] <- data.frame(trials = trials[trials != loop$trials[i]])
                        }     
                        list(loop = loop, submodels = submodels)
                      }
    This takes the tuning parameter grid and picks out some combinations of tuning parameters to fit a model for and others to be “sub-models”. More specifically, for every unique combination of “model” and “winnow”, it picks out the largest value of “trials” and fits that model. All other models with the same “model” and “winnow” values but smaller “trials” values become sub-models of this one. This is because if you fit a C5.0 model withtrials = 50 (50 boosting iterations), you can use that model to make predictions for trials = 50, or any number smaller than 50. The C5.0 package allows the user to “strip off” some of the boosting iterations and recover a model with fewer boosting iterations. So instead of fitting a model with 50 boosting iterations, then fitting another model with 40 boosting iterations, and so on, you fit one model with 50 boosting iterations and then use it to generate predictions for all of its sub-models. Note that the model fit with 50 boosting iterations and then stripped back to 40 iterations is exactly the same as the model fit using only 40 boosting iterations. So you save a ton of time by not fitting each model.
    Now here’s the tricky part. The default loop function takes the largest trials value for each model/winnow value pair and makes all smallertrials values sub-models of that model. We want to make sub-models of each of these models containing each cutoff value. Since the caretframework doesn’t allow for sub-sub-models (to the best of my knowledge), we will just make the model with the largest trials and the largestcutoff the “main” model for each model/winnow/fuzzy combination and make all trials/cutoff combinations sub-models of this “main” model.
    my_mod$loop <- function(grid){
      
      # for each model/winnow/fuzzy combination, choose the model with
      # the largest trials and cutoff to be the "main" model
      loop <- ddply(grid, c("model", "winnow", "fuzzy"),
                    function(x) c(trials = max(x$trials), cutoff = max(x$cutoff)))
      
      # create a list with one element for each "main" model
      submodels <- vector(mode = "list", length = nrow(loop))
      
      # for each "main" model, make a grid of all sub-models
      for(i in 1:nrow(loop)){
        index <- which(grid$model == loop$model[i] &
                       grid$winnow == loop$winnow[i] &
                       grid$fuzzy == loop$fuzzy[i] & 
                       (grid$trials < loop$trials[i] | grid$cutoff < loop$cutoff[i]))
        
        submodels[[i]] <- grid[index, c('trials', 'cutoff')]
      }
      
      # return the main models and the sub-models separately
      list(loop = loop, submodels = submodels)
    }

    predict

    The predict element is a function for making class predictions (as opposed to class probabilities) given a model. This also utilizes the sub-models defined by the loop element to make predictions for any sub-models associated with the model object passed to it. There will only be model objects created for the “main” models and never for the sub-models. This is where we save a ton of time. Every sub-model is a model we don’t have to fit. Brilliant!
    my_mod$predict <- function(modelFit, newdata, submodels = NULL){
      
      # generate probability of membership in first class
      class1Prob <- predict(modelFit, newdata, type='prob')[, modelFit$obsLevels[1]]
      
      # now assign observations to a class using the cutoff value provided
      out <- ifelse(class1Prob >= modelFit$tuneValue$cutoff,
                    modelFit$obsLevels[1],
                    modelFit$obsLevels[2])
      
      # now make predictions for any sub-models
      if(!is.null(submodels)){
        tmp <- out # hold this for a second plz
        # create a list with an element for each sub-model plus one for the main model
        out <- vector(mode = "list", length = nrow(submodels) + 1)
        out[[1]] <- tmp # the first element is the main model
        
        for(j in 1:nrow(submodels)){
          class1Prob_again <- predict(modelFit, newdata, type='prob', trial=submodels$trials[j])[, modelFit$obsLevels[1]]
          out[[j+1]] <- ifelse(class1Prob_again >= submodels$cutoff[j],
                               modelFit$obsLevels[1],
                               modelFit$obsLevels[2])
        }
      }
      out
    }

    prob

    The prob element is a lot like the predict element, only it returns class probabilities instead of event/non-event predictions. These probabilities are independent of the cutoff point, so we don’t need to change anything in prob to accommodate that. It also takes a model object as an input, which contains the information about the added fuzzy parameter, so we don’t need to change anything to accommodate that either. So we can leave prob as is.
    We also don’t need to change sortlevelspredictors, or varImp. If you’re looking for an example of modifying those, I’ll direct you to thecaret page on customizing model tuning.
    Now let’s take this model for a test drive. I’ll generate some data with a rare class and a prevalent class, using caret’s twoClassSim() function.
    set.seed(442)
    trainingSet <- twoClassSim(n = 500, intercept = 7)
    testingSet  <- twoClassSim(n = 500, intercept = 7)
    Class1 is the rare class and Class2 is the prevalent class.
    table(trainingSet$Class)
    
    Class1 Class2 
        48    452 
    I’ll need to define a performance measure to decide which combination of hyperparameters is best. I’ll go with the performance measure mentioned earlier - the euclidean distance to the perfect model with sensitivity = 1 and specificity = 1. (This is taken from the caret website.)
    fourStats <- function(data, lev = levels(data$obs), model = NULL){
    
      out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
    
      # The best possible model has sensitivity of 1 and specificity of 1. 
      # How far are we from that value?
      coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), ncol = 2, byrow = TRUE)
      colnames(coords) <- c("Spec", "Sens")
      rownames(coords) <- c("Best", "Current")
      c(out, Dist = dist(coords)[1])
    }
    Next I’ll define a grid of hyperparameters to search over. I set winnow = FALSE because winnowing is best for situations with small n and large p. In my experience a model with winnowing is never better than one without if you have n>p. I also limit trials to a maximum of 50. This is just for the sake of speed. Since we know that the prob element of the custom model returns the probabilities of membership in the first class and we know that Class1 is the rare class, I expect the optimal cutoff point will be somewhere below the default value of 0.5. Since it costs almost nothing to evaluate multiple cutoff points, I consider 21 different possible cutoff point values.
    mygrid <- expand.grid(trials=c(1, 1:4*10), 
                          model=c('rules', 'tree'), 
                          winnow=FALSE, 
                          fuzzy=c(TRUE, FALSE), 
                          cutoff=c(0.01, seq(0.025, 0.5, by=0.025)))
    Next I define the resampling scheme for my model fitting. 10-fold cross-validation, repeated 5 times seems pretty thorough. I also supply the summary function fourStats and set classProbs = TRUE so that fourStats() can compute the model ROC.
    mycontrol <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 5,
                              classProbs = TRUE,
                              summaryFunction = fourStats)
    Now let’s finally put the custom model to work.
    library(plyr)
    library(C50)
    
    set.seed(949)
    mod1 <- train(Class ~ ., data = trainingSet,
                  method = my_mod,
                  ## Minimize the distance to the perfect model
                  metric = "Dist",
                  maximize = FALSE,
                  tuneGrid=mygrid,
                  trControl = mycontrol)
    The fact that this ran without an error is encouraging. It’s always prudent to inspect the results to make sure they make sense.
    ggplot.train(mod1)
    As expected, cutoff points below 0.5 perform best and models with more boosting iterations perform better. What were the optimal model hyperparameters? How good was the model?
    best_ind <- best(x=mod1$results, metric='Dist', maximize=FALSE)
    mod1$results[best_ind, ]
       model winnow fuzzy trials cutoff       ROC  Sens      Spec      Dist
    92 rules  FALSE FALSE     40  0.175 0.9330292 0.885 0.8548696 0.2135743
            ROCSD    SensSD     SpecSD    DistSD
    92 0.06782202 0.1433029 0.04790972 0.1061781
    I plan on writing a follow up post to this one in which I will compare this model to several other methods of classifying unbalanced data. But for now, this code works. Whether or not it’s useful remains to be seen.
    3

    View comments

  4. As I’m sure many statisticians do, I keep a folder of “stock code”, or template scripts that do different things. This folder is always growing and the scripts are always improving, but there are a few in there that I think are worth sharing. Some of these are templates for common analyses, while others are just reminders of how to use a couple of commands to accomplish a practical task.
    This post is of the latter type. I’m going to discuss fetching data from a URL.
    Why might one need to fetch data from a URL?
    • You want to share your code with someone who isn’t familiar with R and you want to avoid the inevitable explanation of how to change the file path at the beginning of the file. (“Make sure you only use forward slashes!”)
    • The data at the URL is constantly changing and you want your analysis to use the latest each time you run it.
    • You want the code to just work when it’s run from another machine with another directory tree.
    • You want to post a completely repeatable analysis on your blog and you don’t want it to begin with “go to www.blahblahblah.com, download this data, and load it into R”.
    Whatever your reason may be, it’s a neat trick, but it’s not one I use so often that I can just rattle off the code for it from memory. So here’s my template. I hope it can help someone else.

    Caveat!!!

    This is only for data that is in tabular form already. This is not for web scraping (i.e. extracting a table of data from a Wikipedia page.) There areentire packages devoted to that. This is for the simplest of all cases where there is a .csv file or a .txt file (or similar) at a URL and you want to read it into R directly from that URL without the intermediate step of saving it somewhere on your computer.

    Using data.table’s fread()

    I love the data.table package. I use it every day, for almost every project I do. It’s an extension of the data.frame object class in R that makes many improvements. One of those improvements is in the function fread(). It’s data.table’s answer to base R’s read.csv(). It does many things better, but here I’m only going to address its ability to read data right from the web. As a primer, its typical use on a data file residing on your computer would look something like this:
    library(data.table)
    mydat <- fread('C://Some/File/Path.csv')
    Reading data from a source on the web is no different. The example the package authors give in the help file (?fread) is this:
    library(data.table)
    mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
    head(mydat)
       V1  V2   V3    V4 V5
    1:  1 307  930 36.58  0
    2:  2 307  940 36.73  0
    3:  3 307  950 36.93  0
    4:  4 307 1000 37.15  0
    5:  5 307 1010 37.23  0
    6:  6 307 1020 37.24  0
    Now if you actually navigate to that link in your browser, you won’t see anything, but a download dialog should pop up. If you navigate to the parent directory of that address, http://www.stats.ox.ac.uk/pub/datasets/csb you will see some text further down the page you will see several links to data files. Each of these links launches a download dialog when clicked. To grab the URL of the data file to pass to fread(), right click the link and select “Copy link address”. Other data files online might appear in the browser instead of launching download dialog, like this one a professor of mine had us use for an assignment. fread() handles these URLs just the same.
    fread() makes smart decisions about how to read the data in (it detects column names and classes and so on), but the command has several arguments for specifying such things as well that you can use at your own discrimination. I find fread('filename') almost always just works, but sometimes there are reasons to be more explicit when reading data in.

    Using RStudio

    If you’re not familiar with RStudio, you are a true R novice. If you know what it is, but don’t use it, skip ahead.
    In RStudio, you can click “Tools” -> “Import Dataset” -> “From Web URL” and a dialog will pop up asking you for a URL. Paste a URL into the dialog box (let’s just use the same one as before: http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat) and click “OK”. A nice little window pops up and allows you to specify how the data should be read and what name the object should be given in R. When you click “Import”, the data is read in and some code appears in the console. What this interface does is download the data to a temporary file in a temporary folder and then read it in. The downloaded data file persists on your hard drive as long as your R session lasts, but disappears as soon as your R session ends.
    This is handy, but if you wanted to repeat the process, you would have to click through the menu again and supply the data URL again. This isn’t exactly “repeatable” in the Stack Overflow sense of the word.

    Using RCurl’s getURL()

    The RCurl package provides bindings to the cURL library. This is a C library for web connections. The cURL library does way more than we need for this task and frankly, I don’t understand a lot of it. I saved RCurl for last because iI usually try fread() first, and then if I get some sort of error, I resort to RCurl. Take for example the data set at this link: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv (also posted by a professor for an assignment of mine). If you try to fread() it, no dice. I have no idea what that error message means, but here’s how to get that data set in anyway.
    library(RCurl)
    myfile <- getURL('https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
    What are the arguments ssl.verifyhost=F and ssl.verifypeer=F doing? To be quite honest, I don’t really know. But if I’m having trouble reading from a URL I try specifying these arguments and changing one or both to FALSE almost always circumvents whatever error I’m getting.
    This grabs the content residing at the specified URL, but doesn’t return a data.frame object. It has simply put the URL’s content into a string.
    class(myfile)
    [1] "character"
    So how to get this into a data.frame object? We’ll use textConnection() to open a “connection” with the string, much like you would open a connection with a file on your hard drive in order to read it. Then we’ll have read.csv() (or you could use read.table() or fread() or similar) to read the string object like a text file and create a data.frame object.
    mydat <- read.csv(textConnection(myfile), header=T)
    head(mydat)
       Season  Area Gear.Type  Time Tows Bycatch
    1 1989-90 North    Bottom   Day   48       0
    2 1989-90 North    Bottom Night    6       0
    3 1989-90 North Mid-Water Night    1       0
    4 1989-90 South    Bottom   Day  139       0
    5 1989-90 South Mid-Water   Day    6       0
    6 1989-90 South    Bottom Night    6       0
    And there you have it. The data from the URL is now in a data.frame and ready to go.
    Aside: read.csv() is just a vesion of read.table() with argument defaults such as sep = "," that make sense for reading .csv files.

    A Use Case

    Let’s pretend I want to automate something having to do with weather in Chicago. Maybe it’s a knitr document that I have scheduled to re-knit every night on my server. Every time the script re-runs, it should somehow take into account recent weather in Chicago. Weather Undergroundoffers historic (and an hour ago counts as “historic”) hourly weather data for many different locations. Many of these locations are airports, which for obvious reasons, have several meteorological sensors on site. On the Weather Underground page you can select a location and a date and see hourly weather for that calendar day. At the bottom of the page, you can click “Comma Delimited File” to see the data in comma delimited format - perfect for reading into R.
    I see that the four letter airport code for Chicago is “KMDW” and after clicking through a few of these URLs, I see the stuff after “DailyHistory.html” doesn’t change. So if I know the date, I can construct the URL where the hourly Chicago airport wether for that date can be found in .csv format.
    First, I define the beginning and end of the URL, which never change.
    baseURL <- 'http://www.wunderground.com/history/airport/KMDW'
    suffixURL <- 'DailyHistory.html?HideSpecis=1&format=1'
    There is opportunity here to generalize this for many locations if one simply maps the four letter codes to other locations of interest usingswitch() or similar.
    Then I ask the system for todays date and from it produce a string having format year/month/day.
    Date <- Sys.Date()
    datestring <- format(Date, '%Y/%m/%d')
    Then I piece all of these strings together to get a URL which will lead to a .csv file of today’s weather in Chicago.
    url2fetch <- paste(baseURL, datestring, suffixURL, sep='/')
    Finally I grab the content of the webpage at that URL using the RCurl method described above. I choose getURL() instead of fread() for good reason; I’ll need to do some find-and-replace to clean up some html artifacts in the data and that is more efficient to do on one big string rather than on a bunch of individual values in a data.frame.
    url_content <- getURL(url2fetch)
    Now I have the content of the page in a string and I want to read that string into a data.frame object, but every line of the data ends with an html newline (“<br />”) and a text newline (“\n”). read.csv() will recognize the “\n” as a signal to start a new row of the data.frame, but the “<br />” isn’t recognized and will be appended to the value in the last column of every row. So let’s take care of this before read.csv() ever gets involved. I’ll do a simple find-and-replace where I find “<br />” and replace it with an empty string (""), aka nothing. This is the regex way of find-and-delete.
    url_content <- gsub('<br />', '', url_content)
    Finally I can “read” the data into a data.frame object with the help of read.csv() and textConnection().
    weather_data <- read.csv(textConnection(url_content))
    head(weather_data)
       TimeCST TemperatureF Dew.PointF Humidity Sea.Level.PressureIn
    1 12:22 AM         21.9       17.1       82                30.02
    2 12:53 AM         21.9       16.0       78                30.07
    3  1:53 AM         21.9       15.1       75                30.09
    4  2:24 AM         21.0       14.0       74                30.04
    5  2:39 AM         21.0       14.0       74                30.04
    6  2:53 AM         21.0       15.1       78                30.09
      VisibilityMPH Wind.Direction Wind.SpeedMPH Gust.SpeedMPH PrecipitationIn
    1           1.0            NNE          13.8             -            0.01
    2           1.0            NNE          15.0             -            0.01
    3           4.0            NNE          11.5             -            0.00
    4           2.5            NNE          16.1             -            0.00
    5           1.5            NNE          12.7             -            0.00
    6           1.8            NNE          12.7             -            0.00
      Events Conditions WindDirDegrees             DateUTC
    1   Snow       Snow             30 2015-02-26 06:22:00
    2   Snow Light Snow             30 2015-02-26 06:53:00
    3   Snow Light Snow             30 2015-02-26 07:53:00
    4   Snow Light Snow             30 2015-02-26 08:24:00
    5   Snow Light Snow             30 2015-02-26 08:39:00
    6   Snow Light Snow             30 2015-02-26 08:53:00
    Glorious.
    2

    View comments

  5. Generating Tables Using Pander, knitr, and Rmarkdown

    I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal. This workflow is great for many reasons that I won't get into, but one major shortcoming is how difficult it is to get a nice looking table out of it. I was working on a report today in which I am cleaning some data for my boss. I'll quickly replicate this for effect.

    I first plot the data, then notice several outliers that we should trace back to the database and make sure that the values reported are indeed correct and not a typo etc.
    rm(list = ls())
    library(pander)
    
    ## Attaching package: 'pander'
    
    ## The following object is masked from 'package:knitr':
    ## 
    ## pandoc
    
    library(ggplot2)
    mydat <- diamonds
    
    my.plot <- ggplot(mydat, aes(x = cut(carat, 0:6), y = log(price))) + facet_grid(. ~ 
        color, labeller = label_both) + geom_boxplot(fill = "grey", lwd = 0.9)
    print(my.plot)
    
    plot of chunk ChunkSettings
    I see what might be a couple outliers here. (This is actually a cannonical and therefore very clean dataset, but just bear with me.)
    mydat$Outlier <- F
    mydat$Outlier[mydat$color == "J" & mydat$carat > 3 & mydat$carat < 4 & mydat$price < 
        exp(9.5)] <- T
    mydat$Outlier[mydat$color == "I" & mydat$carat > 3 & mydat$carat < 4 & mydat$price < 
        exp(9.1)] <- T
    mydat$Outlier[mydat$color == "I" & mydat$carat > 2 & mydat$carat < 3 & mydat$price < 
        exp(8.7)] <- T
    mydat$Outlier[mydat$color == "H" & mydat$carat > 3 & mydat$carat < 4 & mydat$price > 
        exp(9.5)] <- T
    mydat$Outlier[mydat$color == "E" & mydat$carat > 1 & mydat$carat < 2 & mydat$price < 
        exp(7.5)] <- T
    
    print(my.plot + geom_point(data = subset(mydat, Outlier == T), color = "red", 
        size = 4))
    
    plot of chunk unnamed-chunk-1
    I might want to send these points back to the lab or whoever's in charge of data collection so that they can verify that they're good observations and not typos etc. This is where I'd like to put a good table into my report, with details about these perceived “outliers” so that these points can be checked. It would be easy enough to track this stuff down myself and put it in an email or something, but for argument's sake, I want to automate this report so that every time I get an updated database I can spit one of these guys out without any actual work.
    I could simply do this:
    subset(mydat, Outlier == T)
    
    ##       carat     cut color clarity depth table price    x    y    z Outlier
    ## 13992  2.01    Fair     I      I1  67.4    58  5696 7.71 7.64 5.17    TRUE
    ## 13993  2.01    Fair     I      I1  55.9    64  5696 8.48 8.39 4.71    TRUE
    ## 19340  3.01 Premium     I      I1  62.7    58  8040 9.10 8.97 5.67    TRUE
    ## 21759  3.11    Fair     J      I1  65.9    57  9823 9.15 9.02 5.98    TRUE
    ## 27650  3.01    Good     H     SI2  57.6    64 18593 9.44 9.38 5.42    TRUE
    ## 41919  1.03    Fair     E      I1  78.2    54  1262 5.72 5.59 4.42    TRUE
    
    But that's ugly. I especially hate the row names and the hashes. The hashes are good when you're writing something you expect to be copy and pasted into an R terminal, so that whoever's doing the copy and pasting gets the code copied in and evaluated, but the results don't get evaluated because they are commented out. In my reports I'm actually hiding all of the code and I'd like my results to look a little neater.
    To get rid of the rownames - slightly more verbose, but also slightly neater:
    print(subset(mydat, Outlier == T), row.names = F)
    
    ##  carat     cut color clarity depth table price    x    y    z Outlier
    ##   2.01    Fair     I      I1  67.4    58  5696 7.71 7.64 5.17    TRUE
    ##   2.01    Fair     I      I1  55.9    64  5696 8.48 8.39 4.71    TRUE
    ##   3.01 Premium     I      I1  62.7    58  8040 9.10 8.97 5.67    TRUE
    ##   3.11    Fair     J      I1  65.9    57  9823 9.15 9.02 5.98    TRUE
    ##   3.01    Good     H     SI2  57.6    64 18593 9.44 9.38 5.42    TRUE
    ##   1.03    Fair     E      I1  78.2    54  1262 5.72 5.59 4.42    TRUE
    
    And to get rid of the hashes:
    opts_chunk$set(comment = NA)  # This makes the change globally for every chunk after this one
    
    print(subset(mydat, Outlier == T), row.names = F)
    
     carat     cut color clarity depth table price    x    y    z Outlier
      2.01    Fair     I      I1  67.4    58  5696 7.71 7.64 5.17    TRUE
      2.01    Fair     I      I1  55.9    64  5696 8.48 8.39 4.71    TRUE
      3.01 Premium     I      I1  62.7    58  8040 9.10 8.97 5.67    TRUE
      3.11    Fair     J      I1  65.9    57  9823 9.15 9.02 5.98    TRUE
      3.01    Good     H     SI2  57.6    64 18593 9.44 9.38 5.42    TRUE
      1.03    Fair     E      I1  78.2    54  1262 5.72 5.59 4.42    TRUE
    
    This looks better certainly, but still doesn't look like a table. I can get a little fancier bay calling the pander() function in the pander package, which applies formatting to my table for me:
    pander(subset(mydat, Outlier == T))
    
    
    -----------------------------------------------------
      &nbsp;     carat    cut    color   clarity   depth 
    ----------- ------- ------- ------- --------- -------
     **13992**   2.01    Fair      I       I1      67.4  
    
     **13993**   2.01    Fair      I       I1      55.9  
    
     **19340**   3.01   Premium    I       I1      62.7  
    
     **21759**   3.11    Fair      J       I1      65.9  
    
     **27650**   3.01    Good      H       SI2     57.6  
    
     **41919**   1.03    Fair      E       I1      78.2  
    -----------------------------------------------------
    
    Table: Table continues below
    
    
    ----------------------------------------------------
      &nbsp;     table   price   x    y    z    Outlier 
    ----------- ------- ------- ---- ---- ---- ---------
     **13992**    58     5696   7.71 7.64 5.17   TRUE   
    
     **13993**    64     5696   8.48 8.39 4.71   TRUE   
    
     **19340**    58     8040   9.1  8.97 5.67   TRUE   
    
     **21759**    57     9823   9.15 9.02 5.98   TRUE   
    
     **27650**    64     18593  9.44 9.38 5.42   TRUE   
    
     **41919**    54     1262   5.72 5.59 4.42   TRUE   
    ----------------------------------------------------
    
    But the rownames are back. Ugh. Frustratingly, this is the best solution I've found to this so far:
    temp <- subset(mydat, Outlier == T)
    rownames(temp) <- NULL
    pander(temp)
    
    
    ---------------------------------------------------------
     carat    cut    color   clarity   depth   table   price 
    ------- ------- ------- --------- ------- ------- -------
     2.01    Fair      I       I1      67.4     58     5696  
    
     2.01    Fair      I       I1      55.9     64     5696  
    
     3.01   Premium    I       I1      62.7     58     8040  
    
     3.11    Fair      J       I1      65.9     57     9823  
    
     3.01    Good      H       SI2     57.6     64     18593 
    
     1.03    Fair      E       I1      78.2     54     1262  
    ---------------------------------------------------------
    
    Table: Table continues below
    
    
    ------------------------
     x    y    z    Outlier 
    ---- ---- ---- ---------
    7.71 7.64 5.17   TRUE   
    
    8.48 8.39 4.71   TRUE   
    
    9.1  8.97 5.67   TRUE   
    
    9.15 9.02 5.98   TRUE   
    
    9.44 9.38 5.42   TRUE   
    
    5.72 5.59 4.42   TRUE   
    ------------------------
    
    I also don't really care for the way the table breaks off and then continues underneath. To change this, I must specify the option table.split.table in the pander package. According to the pander documentation
    This option tells pander where to split too wide tables. The default value (80) suggests the conventional number of characters used in a line, feel free to change (e.g. to Inf to disable this feature) if you are not using a VT100 terminal any more :)
    Don't mind if I do.
    panderOptions("table.split.table", Inf) 
    pander(temp)
    
    
    ----------------------------------------------------------------------------------
     carat    cut    color   clarity   depth   table   price   x    y    z    Outlier 
    ------- ------- ------- --------- ------- ------- ------- ---- ---- ---- ---------
     2.01    Fair      I       I1      67.4     58     5696   7.71 7.64 5.17   TRUE   
    
     2.01    Fair      I       I1      55.9     64     5696   8.48 8.39 4.71   TRUE   
    
     3.01   Premium    I       I1      62.7     58     8040   9.1  8.97 5.67   TRUE   
    
     3.11    Fair      J       I1      65.9     57     9823   9.15 9.02 5.98   TRUE   
    
     3.01    Good      H       SI2     57.6     64     18593  9.44 9.38 5.42   TRUE   
    
     1.03    Fair      E       I1      78.2     54     1262   5.72 5.59 4.42   TRUE   
    ----------------------------------------------------------------------------------
    
    If I set table.split.table to Inf however, when I convert the .md file to a .docx file the table will run over onto the next line in Word and it will look awful like the one below.
    alt text
    So I set the table.split.table to 105 instead, which seems to be about as much as Word can handle on one line given the font and so forth that I'm using.
    panderOptions("table.split.table", 105)
    
    Getting there, but can we do better? The pander() function has a number of options for displaying tables. You can set the table style within the call to pander() (which is just a wrapper for pandoc.table()), but I prefer to set these options globally so that all of my tables will look the same when I generate the report:
    panderOptions("table.style", "multiline")  # The default
    pander(temp)
    
    
    ------------------------------------------------------------------------
     carat    cut    color   clarity   depth   table   price   x    y    z  
    ------- ------- ------- --------- ------- ------- ------- ---- ---- ----
     2.01    Fair      I       I1      67.4     58     5696   7.71 7.64 5.17
    
     2.01    Fair      I       I1      55.9     64     5696   8.48 8.39 4.71
    
     3.01   Premium    I       I1      62.7     58     8040   9.1  8.97 5.67
    
     3.11    Fair      J       I1      65.9     57     9823   9.15 9.02 5.98
    
     3.01    Good      H       SI2     57.6     64     18593  9.44 9.38 5.42
    
     1.03    Fair      E       I1      78.2     54     1262   5.72 5.59 4.42
    ------------------------------------------------------------------------
    
    Table: Table continues below
    
    
    ---------
     Outlier 
    ---------
      TRUE   
    
      TRUE   
    
      TRUE   
    
      TRUE   
    
      TRUE   
    
      TRUE   
    ---------
    
    panderOptions("table.style", "simple")
    pander(temp)
    
    
    
     carat    cut    color   clarity   depth   table   price   x    y    z  
    ------- ------- ------- --------- ------- ------- ------- ---- ---- ----
     2.01    Fair      I       I1      67.4     58     5696   7.71 7.64 5.17
     2.01    Fair      I       I1      55.9     64     5696   8.48 8.39 4.71
     3.01   Premium    I       I1      62.7     58     8040   9.1  8.97 5.67
     3.11    Fair      J       I1      65.9     57     9823   9.15 9.02 5.98
     3.01    Good      H       SI2     57.6     64     18593  9.44 9.38 5.42
     1.03    Fair      E       I1      78.2     54     1262   5.72 5.59 4.42
    
    Table: Table continues below
    
    
    
     Outlier 
    ---------
      TRUE   
      TRUE   
      TRUE   
      TRUE   
      TRUE   
      TRUE   
    
    panderOptions("table.style", "grid")
    pander(temp)
    
    
    
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    |  carat  |   cut   |  color  |  clarity  |  depth  |  table  |  price  |  x   |  y   |  z   |
    +=========+=========+=========+===========+=========+=========+=========+======+======+======+
    |  2.01   |  Fair   |    I    |    I1     |  67.4   |   58    |  5696   | 7.71 | 7.64 | 5.17 |
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    |  2.01   |  Fair   |    I    |    I1     |  55.9   |   64    |  5696   | 8.48 | 8.39 | 4.71 |
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    |  3.01   | Premium |    I    |    I1     |  62.7   |   58    |  8040   | 9.1  | 8.97 | 5.67 |
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    |  3.11   |  Fair   |    J    |    I1     |  65.9   |   57    |  9823   | 9.15 | 9.02 | 5.98 |
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    |  3.01   |  Good   |    H    |    SI2    |  57.6   |   64    |  18593  | 9.44 | 9.38 | 5.42 |
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    |  1.03   |  Fair   |    E    |    I1     |  78.2   |   54    |  1262   | 5.72 | 5.59 | 4.42 |
    +---------+---------+---------+-----------+---------+---------+---------+------+------+------+
    
    Table: Table continues below
    
    
    
    +-----------+
    |  Outlier  |
    +===========+
    |   TRUE    |
    +-----------+
    |   TRUE    |
    +-----------+
    |   TRUE    |
    +-----------+
    |   TRUE    |
    +-----------+
    |   TRUE    |
    +-----------+
    |   TRUE    |
    +-----------+
    
    panderOptions("table.style", "rmarkdown")
    pander(temp)
    
    
    
    |  carat  |   cut   |  color  |  clarity  |  depth  |  table  |  price  |  x   |  y   |  z   |
    |:-------:|:-------:|:-------:|:---------:|:-------:|:-------:|:-------:|:----:|:----:|:----:|
    |  2.01   |  Fair   |    I    |    I1     |  67.4   |   58    |  5696   | 7.71 | 7.64 | 5.17 |
    |  2.01   |  Fair   |    I    |    I1     |  55.9   |   64    |  5696   | 8.48 | 8.39 | 4.71 |
    |  3.01   | Premium |    I    |    I1     |  62.7   |   58    |  8040   | 9.1  | 8.97 | 5.67 |
    |  3.11   |  Fair   |    J    |    I1     |  65.9   |   57    |  9823   | 9.15 | 9.02 | 5.98 |
    |  3.01   |  Good   |    H    |    SI2    |  57.6   |   64    |  18593  | 9.44 | 9.38 | 5.42 |
    |  1.03   |  Fair   |    E    |    I1     |  78.2   |   54    |  1262   | 5.72 | 5.59 | 4.42 |
    
    Table: Table continues below
    
    
    
    |  Outlier  |
    |:---------:|
    |   TRUE    |
    |   TRUE    |
    |   TRUE    |
    |   TRUE    |
    |   TRUE    |
    |   TRUE    |
    
    Now perhaps you've heard of the package xtable. This is a nice package for making .html tables in your documents. I've used this before and I'm just copy and pasting a couple lines from a report I wrote a few months ago:
    library(xtable)
    print(xtable(temp, align = rep("c", dim(temp)[2] + 1)), type = "html")
    
    <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
    <!-- Wed Jun 19 10:40:22 2013 -->
    <TABLE border=1>
    <TR> <TH>  </TH> <TH> carat </TH> <TH> cut </TH> <TH> color </TH> <TH> clarity </TH> <TH> depth </TH> <TH> table </TH> <TH> price </TH> <TH> x </TH> <TH> y </TH> <TH> z </TH> <TH> Outlier </TH>  </TR>
      <TR> <TD align="center"> 1 </TD> <TD align="center"> 2.01 </TD> <TD align="center"> Fair </TD> <TD align="center"> I </TD> <TD align="center"> I1 </TD> <TD align="center"> 67.40 </TD> <TD align="center"> 58.00 </TD> <TD align="center"> 5696 </TD> <TD align="center"> 7.71 </TD> <TD align="center"> 7.64 </TD> <TD align="center"> 5.17 </TD> <TD align="center"> TRUE </TD> </TR>
      <TR> <TD align="center"> 2 </TD> <TD align="center"> 2.01 </TD> <TD align="center"> Fair </TD> <TD align="center"> I </TD> <TD align="center"> I1 </TD> <TD align="center"> 55.90 </TD> <TD align="center"> 64.00 </TD> <TD align="center"> 5696 </TD> <TD align="center"> 8.48 </TD> <TD align="center"> 8.39 </TD> <TD align="center"> 4.71 </TD> <TD align="center"> TRUE </TD> </TR>
      <TR> <TD align="center"> 3 </TD> <TD align="center"> 3.01 </TD> <TD align="center"> Premium </TD> <TD align="center"> I </TD> <TD align="center"> I1 </TD> <TD align="center"> 62.70 </TD> <TD align="center"> 58.00 </TD> <TD align="center"> 8040 </TD> <TD align="center"> 9.10 </TD> <TD align="center"> 8.97 </TD> <TD align="center"> 5.67 </TD> <TD align="center"> TRUE </TD> </TR>
      <TR> <TD align="center"> 4 </TD> <TD align="center"> 3.11 </TD> <TD align="center"> Fair </TD> <TD align="center"> J </TD> <TD align="center"> I1 </TD> <TD align="center"> 65.90 </TD> <TD align="center"> 57.00 </TD> <TD align="center"> 9823 </TD> <TD align="center"> 9.15 </TD> <TD align="center"> 9.02 </TD> <TD align="center"> 5.98 </TD> <TD align="center"> TRUE </TD> </TR>
      <TR> <TD align="center"> 5 </TD> <TD align="center"> 3.01 </TD> <TD align="center"> Good </TD> <TD align="center"> H </TD> <TD align="center"> SI2 </TD> <TD align="center"> 57.60 </TD> <TD align="center"> 64.00 </TD> <TD align="center"> 18593 </TD> <TD align="center"> 9.44 </TD> <TD align="center"> 9.38 </TD> <TD align="center"> 5.42 </TD> <TD align="center"> TRUE </TD> </TR>
      <TR> <TD align="center"> 6 </TD> <TD align="center"> 1.03 </TD> <TD align="center"> Fair </TD> <TD align="center"> E </TD> <TD align="center"> I1 </TD> <TD align="center"> 78.20 </TD> <TD align="center"> 54.00 </TD> <TD align="center"> 1262 </TD> <TD align="center"> 5.72 </TD> <TD align="center"> 5.59 </TD> <TD align="center"> 4.42 </TD> <TD align="center"> TRUE </TD> </TR>
       </TABLE>
    
    Ok this looks pretty silly. It's not even a table! Well, you have to use xtable() in conjucntion with the knitr chunk setting results = 'asis'. You don't want to do this globally to all chunks in your report - only to the ones that generate a table. If you don't know how to do this, I'll leave you to find it in the knitr documentation on your own bacause I'm using RStudio and that's kind of a game changer. Chances are if you got this far in the post, you know how to do this.
    Note: The align=rep('c',dim(temp)[2]+1)) part of the call to xtable() is to center align my values inside the cells of the table. You might have noticed this was the default in all of my calls to pander().
    Also Note: The rownames are back, damn them. I'm not going to demonstrate getting rid of them because I suffer from severe apathy.
    print(xtable(temp, align = rep("c", dim(temp)[2] + 1)), type = "html")
    
    caratcutcolorclaritydepthtablepricexyzOutlier
    12.01FairII167.4058.0056967.717.645.17TRUE
    22.01FairII155.9064.0056968.488.394.71TRUE
    33.01PremiumII162.7058.0080409.108.975.67TRUE
    43.11FairJI165.9057.0098239.159.025.98TRUE
    53.01GoodHSI257.6064.00185939.449.385.42TRUE
    61.03FairEI178.2054.0012625.725.594.42TRUE
    Ok, so when I supply the setting results = 'asis' to my xtable() chunk, I get a nice table in .html, but when I use pander to convert it to a .docx (which we'll get to in a minute), it absolutely butchers it. My little table now spans more than 3 pages in word:
    alt text
    So xtable() is great for .html tables, but it is often verbose to use it and it can be difficult to make even small tweaks which pander() handles pretty easily. The dealbreaker is really its incompatibility with conversion to word via Pandoc.convert() though.
    But I have a hunch here. What happens if I try this results='asis' thing on some of my pander() chunks?
    Note: I'm sure this is in the pander package documentation somewhere online, but I couldn't find anything super clear on it.
    panderOptions("table.style", "multiline")  # Used in conjunction with the chunk setting results='asis'
    pander(temp)
    

    carat cut color clarity depth table price x y z

    2.01 Fair I I1 67.4 58 5696 7.71 7.64 5.17
    2.01 Fair I I1 55.9 64 5696 8.48 8.39 4.71
    3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
    3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
    3.01 Good H SI2 57.6 64 18593 9.44 9.38 5.42

    1.03 Fair E I1 78.2 54 1262 5.72 5.59 4.42

    Table: Table continues below

    Outlier

    TRUE
    TRUE
    TRUE
    TRUE
    TRUE

    TRUE

    Not quite what I'm looking for…
    panderOptions("table.style", "simple")  # Used in conjunction with the chunk setting results='asis'
    pander(temp)
    
    carat cut color clarity depth table price x y z

    2.01 Fair I I1 67.4 58 5696 7.71 7.64 5.17 2.01 Fair I I1 55.9 64 5696 8.48 8.39 4.71 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98 3.01 Good H SI2 57.6 64 18593 9.44 9.38 5.42 1.03 Fair E I1 78.2 54 1262 5.72 5.59 4.42
    Table: Table continues below

    Outlier

    TRUE
    TRUE
    TRUE
    TRUE
    TRUE
    TRUE
    Still no good…
    panderOptions("table.style", "grid")  # Used in conjunction with the chunk setting results='asis'
    pander(temp)
    
    +———+———+———+———–+———+———+———+——+——+——+ | carat | cut | color | clarity | depth | table | price | x | y | z | +=========+=========+=========+===========+=========+=========+=========+======+======+======+ | 2.01 | Fair | I | I1 | 67.4 | 58 | 5696 | 7.71 | 7.64 | 5.17 | +———+———+———+———–+———+———+———+——+——+——+ | 2.01 | Fair | I | I1 | 55.9 | 64 | 5696 | 8.48 | 8.39 | 4.71 | +———+———+———+———–+———+———+———+——+——+——+ | 3.01 | Premium | I | I1 | 62.7 | 58 | 8040 | 9.1 | 8.97 | 5.67 | +———+———+———+———–+———+———+———+——+——+——+ | 3.11 | Fair | J | I1 | 65.9 | 57 | 9823 | 9.15 | 9.02 | 5.98 | +———+———+———+———–+———+———+———+——+——+——+ | 3.01 | Good | H | SI2 | 57.6 | 64 | 18593 | 9.44 | 9.38 | 5.42 | +———+———+———+———–+———+———+———+——+——+——+ | 1.03 | Fair | E | I1 | 78.2 | 54 | 1262 | 5.72 | 5.59 | 4.42 | +———+———+———+———–+———+———+———+——+——+——+
    Table: Table continues below
    +———–+ | Outlier | +===========+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+
    Definitely not…
    panderOptions("table.style", "rmarkdown")  # Used in conjunction with the chunk setting results='asis'
    pander(temp)
    
    caratcutcolorclaritydepthtablepricexyz
    2.01FairII167.45856967.717.645.17
    2.01FairII155.96456968.488.394.71
    3.01PremiumII162.75880409.18.975.67
    3.11FairJI165.95798239.159.025.98
    3.01GoodHSI257.664185939.449.385.42
    1.03FairEI178.25412625.725.594.42
    Table: Table continues below
    Outlier
    TRUE
    TRUE
    TRUE
    TRUE
    TRUE
    TRUE
    Bingo! This looks good it .html, but will it also look good once I convert it to Word? Interestingly, all of the pander() style tables convert nicely into a table object in Word when I convert into a .docx when I specify results = 'asis'. I can manipulate these tables as such, sorting columns etc. This is what I've been after!
    Note: If your table runneth over (i.e. it is wide enough that you invoke the ugly “Table: Table continues below” message from pander()), the “below” part doesn't always get turned into a table like it ought to. The only pander() style that had trouble with this was the 'rmarkdown' style. The other 3 styles, once converted to .docx yield a table, the message “Table continues below”, and then another table object below. As such, I recommend usingpanderOptions('table.split.table', Inf) so that everything gets crammed into one table no matter what. You can then resize in Word as necessary.
    Real quick, here's the code for the conversion into a .docx:
    Pandoc.convert("Tables with knitr and pander.md", format = "docx")  # Assuming my working directory is set to where this file is saved
    
    So to recap, here's what you need to do to get a good looking report if you use the same workflow as I do:
    1. Set panderOptions('table.split.table', Inf)
    2. Set panderOptions('table.style', 'rmarkdown') I like this one best because it looks good in .html and in Word. (Do 1 and 2 just once, at the beggining of your .Rmd script)
    3. Set the individual chunk option results = 'asis'
    4. Get rid of those stupid rownames

    Go make some tables!

    2

    View comments

  6. R vs. Perl/mySQL - an applied genomics showdown

    Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute. I consulted the professor that taught me to use R and he said that if I could boil the computationally intensive portion of the program down to an apply() command I could take advantage of the vectorized nature of R and greatly improve my speed. I constructed my solution around this methodology and had great success. The assignment as it was given is as follows:

    Match Regulatory Elements in the Genome to the Closest Gene

    Genomic genius Ginny Genes has performed a gigantic genome-wide experiment to find regulatory elements in human embryonic stem cells. She wants to know what are the possible target genes of these regulatory elements. She has the locations of the regulatory elements in a bed-formatted, text, tab-delimited file with the following fields:
    1) Chromosome
    2) Start position in chromosome
    3) End position in chromosome
    4) Randomly assigned name
    Your assignment is to help Ginny by finding for each regulatory element both the closest gene upstream and the closest gene downstream of the regulatory element.
    The input file “RegulatoryElements.bed” contains the list of regulatory positions. You should open this file, read each line one at a time, search for the gene upstream and downstream closest based on the transcription start site, and print out the following information for each on a single line:
    1) Chromosome of gene
    2) Start position of gene
    3) End position of gene
    4) Name of gene (common name - name2 field)
    5) Genomic strand (+ or -)
    6) Name of corresponding regulatory element

    My solution:

    First I need to read the files described above into R. This is simple enough, but for added speed I use the colClasses = argument in the read.table() function. Basically this allows me to pre-specify what class each column of the file should be read in as. This saves some time because normally R decides this for you and if you're reading in a lot of data, this decision requires a lot of information and time. For example, if I have a file with a million rows of data, and all of the entries in column 1 are numbers except the last one, which is a character string, then R must check each entry in this column before eventually assigning it to be of class factor. This is time consuming. Also, R by default converts any column with even one string in it as a factor, which requires some conversion. I the example I just gave of 999,999,999 numbers and 1 string, if all 999,999,999 numbers were unique, I would get a factor with a million levels. This would take some time to create. Notice below that I avoid creating factor variables except in particularly simple to convert cases where there are only a few possible levels. Reading in things like Gene Name ascharacter columns instead of factors saves some time.
    # Clear memory
    rm(list = ls())
    
    # Load in the file containing Regulatory Element data
    mydat <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/RegulatoryElements.bed", 
        header = F, sep = "\t", colClasses = c("factor", "numeric", "numeric", 
            "character"))
    names(mydat) <- c("chrom", "txStart", "txEnd", "name")
    
    # Load in Genes data
    genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
        sep = "\t", header = F, colClasses = c("integer", "character", "factor", 
            "factor", rep("numeric", 4), "integer", "character", "character", "integer", 
            "character", "factor", "factor", "character"))
    names(genes) <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", 
        "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", 
        "cdsEndStat", "exonFrames")
    
    These are pretty big data frames, particularly the genes data frame:
    dim(mydat)
    
    ## [1] 129061      4
    
    dim(genes)
    
    ## [1] 167536     16
    
    Lets actually compare the time it takes to read the data with and without specifying my colClasses up front. This is easily done by wrapping the command you wish to time (in this case my call to read.table()) inside the function system.time(). I'll compare for just the large genes data frame.
    # Time without specifying colClasses
    system.time(genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
        sep = "\t", header = F))
    
    ##    user  system elapsed 
    ##   27.19    0.11   27.41
    
    
    # Time with colClasses specified
    system.time(genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
        sep = "\t", header = F, colClasses = c("integer", "character", "factor", 
            "factor", rep("numeric", 4), "integer", "character", "character", "integer", 
            "character", "factor", "factor", "character")))
    
    ##    user  system elapsed 
    ##    8.47    0.05    8.54
    
    That's a pretty big difference!
    A few more words about the system.time() function: This function doesn't just return a summary of the time ellapsed, it also actually executes the command wrapped inside. So after running either of the commands above I get a summary of the time ellapsed in my output and the object genes now exists in my workspace.
    Now that I have my data loaded in, it's time to start mathcing regulatory elements in my data frame mydat to nearby genes in my huge data frame named genes.
    Since it is my goal to eventually use an apply() function to accomplish this, and apply() functions are most efficient when each iteration only produces one number like an index or subscript, I will create an index variable in thegenes data frame that I will be searching through to find the nearest gene.
    # Create and ID variable for the genes dataframe
    genes$ID <- 1:nrow(genes)
    
    Next, lets think about what we are actually going to do. For each regulatory element in mydat (one is described in each row), I want to find the closest gene, both upstream and downstream. For a particular regulatory element, I can immediately eliminate from my search all of the genes that are on a different chromosome. I anticipate that this will save me lots of search time by narrowing down where I'm searching. But I don't want to subset my enormousgenes dataframe for each iteration of my solution function that I will be apply()ing to mydat. So here I call on a cool function split() to break genes into seperate data frames (and deposit them into a list) one time instead of subsetting genes over and over again. To get a better idea of what split() does, let's look at a simple example.
    First create an example data frame:
    ex <- data.frame(chrom = rep(1:2, each = 13), txStart = letters)
    ex
    
    ##    chrom txStart
    ## 1      1       a
    ## 2      1       b
    ## 3      1       c
    ## 4      1       d
    ## 5      1       e
    ## 6      1       f
    ## 7      1       g
    ## 8      1       h
    ## 9      1       i
    ## 10     1       j
    ## 11     1       k
    ## 12     1       l
    ## 13     1       m
    ## 14     2       n
    ## 15     2       o
    ## 16     2       p
    ## 17     2       q
    ## 18     2       r
    ## 19     2       s
    ## 20     2       t
    ## 21     2       u
    ## 22     2       v
    ## 23     2       w
    ## 24     2       x
    ## 25     2       y
    ## 26     2       z
    
    Now split() it by the column chrom:
    split(ex, ex$chrom)
    
    ## $`1`
    ##    chrom txStart
    ## 1      1       a
    ## 2      1       b
    ## 3      1       c
    ## 4      1       d
    ## 5      1       e
    ## 6      1       f
    ## 7      1       g
    ## 8      1       h
    ## 9      1       i
    ## 10     1       j
    ## 11     1       k
    ## 12     1       l
    ## 13     1       m
    ## 
    ## $`2`
    ##    chrom txStart
    ## 14     2       n
    ## 15     2       o
    ## 16     2       p
    ## 17     2       q
    ## 18     2       r
    ## 19     2       s
    ## 20     2       t
    ## 21     2       u
    ## 22     2       v
    ## 23     2       w
    ## 24     2       x
    ## 25     2       y
    ## 26     2       z
    
    This is a list of length 2. Each element in the list is a data frame. Notice the names of the list elements:
    names(split(ex, ex$chrom))
    
    ## [1] "1" "2"
    
    The list element named "1" contains all of the rows from the original data frame (ex) that had the value “1” in the chrom column.
    Now lets diverge again from the programming and talk about the science of what we're trying to do. We are trying to find the gene closest to each transcription factor because that is the gene that the transcription factor most likely acts on. Transcription happens directionally. It always happens from the 5' end to the 3' end of the DNA strand. This would be left to right on the positive strand and right to left on the negative strand - both strands are numbered left to right, even the negative strand. How do I know this?
    sum(genes$txEnd < genes$txStart)
    
    ## [1] 0
    
    None of the txEnd (locant of where the gene ends) values are less than the txStart (where the gene begins) values. If the locants were numbered in the order of the way they were transcribed, all of the txStart sites on the negative strand would be less than the txEnd sites on the lagging (-) strand.
    It only makes sense that a regulatory element would act on a gene downstream of it, so on the positive strand, we should look for the nearest txStart site and on the negative strand we should look for the nearest txEnd site. This is actually the transcription start site. So if we need to search different ways depending on which strand we are on, we can further subdivide the search task. Now I want to split() my genes data frame by chromosome and strand. How might I go about split()ing by 2 variables?
    The easiest way would be to create a new variable that is a combination of chrom and strand, then split by that. An example:
    ex$strand <- rep(c("+", "-"), 13)
    ex$new_variable <- paste(ex$chrom, ex$strand)
    ex
    
    ##    chrom txStart strand new_variable
    ## 1      1       a      +          1 +
    ## 2      1       b      -          1 -
    ## 3      1       c      +          1 +
    ## 4      1       d      -          1 -
    ## 5      1       e      +          1 +
    ## 6      1       f      -          1 -
    ## 7      1       g      +          1 +
    ## 8      1       h      -          1 -
    ## 9      1       i      +          1 +
    ## 10     1       j      -          1 -
    ## 11     1       k      +          1 +
    ## 12     1       l      -          1 -
    ## 13     1       m      +          1 +
    ## 14     2       n      -          2 -
    ## 15     2       o      +          2 +
    ## 16     2       p      -          2 -
    ## 17     2       q      +          2 +
    ## 18     2       r      -          2 -
    ## 19     2       s      +          2 +
    ## 20     2       t      -          2 -
    ## 21     2       u      +          2 +
    ## 22     2       v      -          2 -
    ## 23     2       w      +          2 +
    ## 24     2       x      -          2 -
    ## 25     2       y      +          2 +
    ## 26     2       z      -          2 -
    
    split(ex, ex$new_variable)
    
    ## $`1 -`
    ##    chrom txStart strand new_variable
    ## 2      1       b      -          1 -
    ## 4      1       d      -          1 -
    ## 6      1       f      -          1 -
    ## 8      1       h      -          1 -
    ## 10     1       j      -          1 -
    ## 12     1       l      -          1 -
    ## 
    ## $`1 +`
    ##    chrom txStart strand new_variable
    ## 1      1       a      +          1 +
    ## 3      1       c      +          1 +
    ## 5      1       e      +          1 +
    ## 7      1       g      +          1 +
    ## 9      1       i      +          1 +
    ## 11     1       k      +          1 +
    ## 13     1       m      +          1 +
    ## 
    ## $`2 -`
    ##    chrom txStart strand new_variable
    ## 14     2       n      -          2 -
    ## 16     2       p      -          2 -
    ## 18     2       r      -          2 -
    ## 20     2       t      -          2 -
    ## 22     2       v      -          2 -
    ## 24     2       x      -          2 -
    ## 26     2       z      -          2 -
    ## 
    ## $`2 +`
    ##    chrom txStart strand new_variable
    ## 15     2       o      +          2 +
    ## 17     2       q      +          2 +
    ## 19     2       s      +          2 +
    ## 21     2       u      +          2 +
    ## 23     2       w      +          2 +
    ## 25     2       y      +          2 +
    
    Notice again the names of this list:
    names(split(ex, ex$new_variable))
    
    ## [1] "1 -" "1 +" "2 -" "2 +"
    
    This does the job nicely, but I can do this slightly more directly without cluttering up my workspace:
    genes.list <- split(genes, paste(genes$chrom, genes$strand))
    
    From this a list is created and each element of the list is a dataframe that is a subset of the genes dataframe. The first element is all those rows in genes which have chrom==1 and strand== -, named “chr1 -” and so on and so forth.
    Next, I create 2 variables in mydat to pick out the appropriate element of genes.list to search through.
    mydat$myvar1 <- paste(mydat$chrom, "+")
    mydat$myvar2 <- paste(mydat$chrom, "-")
    
    Now each row of mydat is associated with 2 elements of genes.list - the '+' and '-' data frames for the correct chromosome.
    Now it is time to create a function to return the ID number (which is really a row number in the original genes dataframe) of the closest gene on the positive strand. I'll create a seperate function to find the closest gene on the negative strand.
    In order to apply() this function later, I must reference my columns by indices, not by name. For example if I wanted to apply() a function that simply printed the chrom for each row in mydat, I would have to write print(x[1])and I could not write print(x$chrom) because x is a vector of just one row, not a data frame with names.
    find.pos.indices <- function(x) {
        f1 <- x[[5]]
        f2 <- as.numeric(x[[2]])  # This is the value of the reg. element's txStart, converted to numeric (from integer)
        myindex1 <- genes.list[[f1]][which.min(abs(genes.list[[f1]]$txStart - f2)), 
            "ID"]
        # This finds the row number within the correct subsetted dataframe that
        # contains the closest gene, then returns the value of ID in that row.
    }
    
    Here lies the crux of this program. apply() functions in R take advantage of the vectorized nature of the language to do what would otherwise be done in loops in a much more efficient way. So I apply() my function over each row of the data frame named mydat and deposit the results into pos.indices, a vector of integers.
    pos.indices <- apply(mydat, 1, find.pos.indices)
    
    pos.indices is just a list of rownumbers that correspond to closest genes, so I use this to attach the variables of interest in the genes data frame to the rows of mydat. For example if pos.indices were only the number 10, R would assign the 10th value of genes$txStart to mydat$PtxStart.
    mydat$PtxStart <- genes$txStart[pos.indices]
    mydat$PtxEnd <- genes$txEnd[pos.indices]
    mydat$Pname2 <- genes$name2[pos.indices]
    mydat$Pstrand <- genes$strand[pos.indices]
    
    Now repeat the same procedure from above, but for the negative strand.
    find.neg.indices <- function(x) {
        f1 <- x[[6]]
        f2 <- as.numeric(x[[3]])
        myindex1 <- genes.list[[f1]][which.min(abs(genes.list[[f1]]$txStart - f2)), 
            "ID"]
    }
    neg.indices <- apply(mydat, 1, find.neg.indices)
    
    mydat$NtxStart <- genes$txStart[neg.indices]
    mydat$NtxEnd <- genes$txEnd[neg.indices]
    mydat$Nname2 <- genes$name2[neg.indices]
    mydat$Nstrand <- genes$strand[neg.indices]
    
    # Get rid of the variables I created
    mydat$myvar1 <- NULL
    mydat$myvar2 <- NULL
    
    Done! Now lets save the resulting data frame and see how long this entire process takes.
    # Write the dataframe to a .csv file
    write.csv(mydat, file = "< Path/Filename.csv >")
    
    If you want to clock the time of this program, save the code as a .R file (an R script), then enter the following command:
    system.time(source("<path_to_R_script>"))
    
    On my computer, the entire process from reading in the data to writing the resulting data frame to a .csv file took 58.42 seconds. A friend of mine used Perl to interface with a mySQL database that already had the data loaded into it and his program took over 700 seconds to accomplish the same task. This was the way we were supposed to complete the assignment for class, but instead of struggling with the unfamiliar Perl syntax I used my native language ® and had great success. Many of my classmates had run times of 7 or more hours. Needless to say, that is a far cry from under a minute.
    One of the ideas of the assignment was to get us to interface with a database and I did not do this, but I was kind of proud of myself that I managed to do the entire assignment in base R with no packages.

    0

    Add a comment

  7. Stop Sign Sampling Project

    Post 1: Planning Phase

    Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill. This will obviously require a well thought out survey design, some GIS tasks and of course some survey data analysis. Here's what we've got so far:
    I'm using a bunch of packages here. This is really a great example of how awesome R is. So many people contribute to make R capable of just about anything you can imagine. Thanks to all of the package authors out there!
    # Clear memory
    rm(list = ls())
    # Load packages I will need later - This is a pretty package intensive
    # project
    require(ggplot2)
    require(ggmap)
    require(sp)
    require(rgdal)
    require(foreign)
    require(grid)
    require(lattice)
    require(maptools)
    require(splancs)
    require(plyr)
    require(gpclib)
    require(xtable)
    require(bitops)
    require(RCurl)
    gpclibPermit()  # required for fortify method
    

    Objective:

    Answer the research question “What proportion of drivers come to a proper 3 second stop at stop signs in Chapel Hill?” After interviewing Officer Kuhns of the Chapel Hill Police Department, we define a proper stop as “no tire rotation.”

    Target Population:

    All cars at all stop signs in Chapel Hill.

    Sampling Unit:

    Stop signs

    Observation Unit:

    Individual cars

    Strategy

    The simplest way to approach this research question would be to obtain a directory of all stop signs within our target popualtion and randomly select a sample.
    Unfortunately, no such directory exists. To approximate such a strategy we will use systematic random sampling to select a random sample of stop signs in Chapel Hill. We will randomly generate a point on the map that lies within the boundries of Chapel Hill and select the nearest stop sign to that point to be included in our sample. This presents a problem, however, because not all stop signs will have an equal probability of selesction. Those that are not near any others will have a higher probability of selection, while thos that are in regions of particularly dense “stop sign-age” will have a lower probability of selection. To remedy this, we have divided Chapel Hill into zones that we expect to have similar densities of stop signs: residential, commercial, and institutional zones. We will treat the zones as strata and sample equally from each one.
    We have obtained a GIS layer from the Town of Chapel Hill that clearly defines what is inside and what is outside of the town limits, as well 35 different “zones” as defined by the town of Chapel Hill. Find it here and download it to your computer, then load it using readOGR() or follow the procedure given below to download it into a temporary folder, unzip it, and read it in - all from within R. Now that's pretty sweet.
    # create a temporary file and a temporary directory on your local disk
    tf <- tempfile()
    td <- tempdir()
    
    # run the download file function, download the zipped file (zipped files
    # are binary) and save the result to the temporary file
    download.file("http://gis.townofchapelhill.org/GIS_Data/zoning_districts.zip", 
        tf, mode = "wb")
    
    # unzip the files to the temporary directory
    files <- unzip(tf, exdir = td)  # unzipped files are now stored in the object `files`
    
    # give the path (stored in `td`) to the argument 'dsn' and the filename
    # without any filetype to the argument 'layer'
    zones <- readOGR(dsn = td, layer = "zoning_districts", verbose = F)
    zones1 <- spTransform(zones, CRS("+proj=longlat +ellps=WGS84"))
    
    Next, I'm going to use the fantastic package ggmap to get a base layer for my map to plot my polygons over. This is made extremely simple by the get_map() function in ggmap. I simply give it a location to be the center of my map and define my “zoom” (this takes a little trial and error, but the rule of thumb is that zoom=0 is the whole world, zoom=10is about the city level, and zoom=21 is building level). There are also a number of maptypes you can choose from including some interesting ones like “watercolor”, but for this project, a roadmap will suffice. A call to get_map() queries Google Maps and gets the map layer requested. Notice the Google watermark in the bottom corner of the map.get_map() must save the layer that it tracked down somewhere and so it creates a temporary file in whatever path is specified by your working directory at the time of the call to get_map(). I've occasionally had some trouble with get_map() and I think it's due to one of two reasons:
    1) It has trouble when my working directory is set to a path with spaces somewhere in it (for example “C:// Users/rnorberg/Stop Sign Project”), or
    2) It has trouble when my working directory is set to a path in which I do not have permission to save/edit files. I'm on a computer that the school gave me and I've never really taken the time to figure out the security settings on it.
    So I set my working directory to be my desktop because I know for a fact I have permission to save in it and has no spaces in its path.
    setwd("C://Users/rnorberg/Desktop")
    map1 <- get_map(location = "Chapel Hill, NC", zoom = 12, maptype = "roadmap")
    
    Ok, now away from the technical stuff and back to the actual project.
    Our intention is to treat the zones in our zoning layer as strata and 35 strata seems like a lot (keep in mind strata implies you must sample within each and I don't want to have to sample at 35+ stop signs), so we collapse these 35 zones into 3 major categories: residential, commercial, and institutional. A map of some of these zones with a key to elucidate what each zone is (they are named with rather uninformative abbreviations in the dataset) can be found here.
    Commercial_Districts <- c("TC-1", "TC-2", "TC-3-C", "CC", "NC", "MU-V", "MU-OI-1", 
        "MU-R-1", "IND", "CC-C", "TC-2-C", "NC-C", "TC-1-C")
    Institutional_Districts <- c("OI-1", "OI-2", "OI-3", "OI-4", "U-1", "OI-2-C", 
        "OI-1-C")
    Residential_Districts <- c("R-LD5", "RI", "R-LD1", "R-1A", "R-1", "R-2", "R-2A", 
        "R-3", "R-4", "R-5", "R-6", "R-SS-C", "RT", "R-3-C", "R-5-C", "R-4-C")
    
    zones1$District <- factor(NA, levels = c("Commercial", "Residential", "Institutional"))
    zones1$District[zones1$ZONING %in% Commercial_Districts] <- "Commercial"
    zones1$District[zones1$ZONING %in% Residential_Districts] <- "Residential"
    zones1$District[zones1$ZONING %in% Institutional_Districts] <- "Institutional"
    
    Basically, what I've done here is created an empty categorical variable with 3 possible values (Commercial, Residential, and Institutional), then searched through the variable zones1$ZONING and if it's value is also a value in my vector Commercial_Districts, I have assigned the value 'Commercial' to my new variable zones1$District. Then I repeat this for my other two District vectors.
    Next I plot the modified layer. Before I do so, I call the fortify() command to transform my object zones1 into an object that the graphics package ggplot2 will recognize. Currently, the object zones1 is of class SpatialPolygonsDataFrame, but ggplot2 will only plot data.frames. I call fortify, adding the argument group='District' to preserve the value of 'District' attached to each polygon. I store the resulting object in a new object, zones2, which is of class data.frame.
    zones2 <- fortify(zones1, region = "District")
    
    Now I can finally plot the polygons! I first plot the base layer stored in the object map1, then I add a geom_polygon on top of it, specifying that this procedure should use the zones2 data frame, then mapping certain aesthetics of the figure to certain variables in zones2 (the stuff inside the aes() argument), and finally specifying that the polygons should be semi-transparent (alpha=0.3). Inside the aes() argument I tell ggplot() that the x axis of the figure should correspond to the variable long in zones2, the y axis should correspond to the variable lat in zones2, and the fill (the color of the inside of the polygons) and the color (the color of the polygon edges) should correspond to thegroup variable which was created from District when we specified region='District' in the call to fortify().
    ggmap(map1)+
      geom_polygon(data=zones2, aes(x=long, y=lat, fill=group, color=group), alpha=.3)+
      theme_bw()+ # slightly modifies the appearance of the figure, only cosmetic
      labs(x='Longitude', y='Latitude') # add my own labels, nicer than the default "long" and "lat"
    
    plot of chunk unnamed-chunk-7
    Well, not bad, but that legend is horrific. It looks like for each individual polygon in the layer (there are many) fortify() has assigned a separate value of group (Residential.1, Residential.2, Residential.3 and so on). Lets do away with the legend then.
    ggmap(map1)+
      geom_polygon(data=zones2, aes(x=long, y=lat, fill=group, color=group), alpha=.3)+
      guides(color=F, fill=F)+ # remove the legend
      theme_bw()+
      labs(x='Longitude', y='Latitude')
    
    plot of chunk unnamed-chunk-8
    That's a little better, but I'm not totally satisfied with the default color scheme, so I'm going to assign my own colors to the polygons using scale_fill_manual() and scale_color_manual(). I must pass these a value for each unique value of group and recall that group has many values, not just the 3 values we originally created in District, so I must give a color value for each of the groups depicted in the messy legend we just got rid of. The polygons will be plotted on alphabetical order by group, so I can use rep() to repeat the same color value one time for each residential polygon, then do the same with a different color value for all of the institutional and commercial polygons.
    # how many residential polygons are there?
    rz <- length(grep("Residential", unique(zones2$group)))
    
    # how many commercial polygons are there?
    cz <- length(grep("Commercial", unique(zones2$group)))
    
    # how many institutional polygons are there?
    iz <- length(grep("Institutional", unique(zones2$group)))
    
    ggmap(map1) + geom_polygon(data = zones2, aes(x = long, y = lat, fill = group, 
        color = group), alpha = 0.3) + scale_fill_manual(values = rep(c("red", "green", 
        "blue"), c(cz, iz, rz))) + scale_color_manual(values = rep(c("red", "green", 
        NA), c(cz, iz, rz))) + guides(color = F, fill = F) + theme_bw() + labs(x = "Longitude", 
        y = "Latitude")
    
    plot of chunk unnamed-chunk-9
    Now that's pretty snazzy. Notice some of the blue areas and red areas overlap. I'm not sure why or how to fix this, but this figure will do for now.
    Again, lets return from the programming tangent and refocus on the big picture.
    Now that we've got our map figured out and our 3 geographic stratum defined, we want to randomly generate points (Lattitude and Longitude) within each region. This is pretty straightforward using the spsample() function from the package sp. This uses a pretty fancy algorithm that goes something like this:
    1) Define a rectangle using the min and max of longitude and latitude in the specified polygon
    2) Randomly generate a longitude and latitude within said rectangle
    3) Is the randomly generated point within the polygon? If yes, keep it, if no, throw it away and repeat. This continues until the number of desired points is reached.
    The tricky part of this is determining if a point lies within a polygon which likely has many irregular edges. Thankfully, all we have to do is call spsample(), specify an object of class SpatialPolygonsDataframe, specify how many points we would like, and specify that we would like them selected randomly. I do this for the 3 disjoint subsets of zones1 specified in the variable District. Also notice that I set the random seed before I begin this process so that this process is random, but reproducible.
    set.seed(2913)
    randpoints1 <- spsample(subset(zones1, District == "Commercial"), 3, type = "random")  # this produces an object of class SpatialPoints
    randpoints1 <- data.frame(coordinates(randpoints1))  # extract the coordinates from randpoints1 and put into an object of class data.frame
    
    randpoints2 <- spsample(subset(zones1, District == "Institutional"), 3, type = "random")
    randpoints2 <- data.frame(coordinates(randpoints2))
    
    randpoints3 <- spsample(subset(zones1, District == "Residential"), 3, type = "random")
    randpoints3 <- data.frame(coordinates(randpoints3))
    
    Now lets plot these points to see if we have succeeded. I simply repeat the commands for the previous plot, but add the geom_point() layers, one for each set of 3 points. In each of these 3 layers I specify inherit.aes=F so that the plotting software does not search for variables named longlat, and group in the new data (randpointsX) that were defined in the first aes() call. I then make a new call to aes() to specify the mappings I would like to use with this new data. The last thing I do is specify some aesthetics outside of the aes() argument. These will apply to all points in that layer, and do not map to a particular variable. For example, I would like all points in the datarandpoints1 to be larger than the default size of 1, not just those with a particular x value, so I put size=4 outside of the aes() argument.
    ggmap(map1)+
      geom_polygon(data=zones2, aes(x=long, y=lat, fill=group, color=group), alpha=.3)+
      scale_fill_manual(values=rep(c('red', 'green', 'blue'), c(cz,iz,rz)))+
      scale_color_manual(values=rep(c('red', 'green', NA), c(cz,iz,rz)))+
      geom_point(data=randpoints1, inherit.aes=F, aes(x=x, y=y), color='red', pch=18, size=4)+
      geom_point(data=randpoints2, inherit.aes=F, aes(x=x, y=y), color='#006600', pch=18, size=4)+
      geom_point(data=randpoints3, inherit.aes=F, aes(x=x, y=y), color='blue', pch=18, size=4)+
      guides(color=F, fill=F)+
      theme_bw()+
      labs(x='Longitude', y='Latitude')
    
    plot of chunk unnamed-chunk-11
    # '#006600' is a particular color of dark green that shows up much better than regular 'green'
    
    This looks like we've been successful! Table 1, below, gives the exact coordinates of these randomly generated coordinates.
    table1 <- rbind(randpoints1, randpoints2, randpoints3)
    names(table1) <- c("Longitude", "Latitude")
    table1$Zone <- rep(c("Commercial", "Institutional", "Residential"), each = 3)
    print(table1[, c(3, 2, 1)], rownames = F)
    
    ##            Zone Latitude Longitude
    ## 1    Commercial    35.95    -79.00
    ## 2    Commercial    35.90    -79.01
    ## 3    Commercial    35.91    -79.02
    ## 4 Institutional    35.91    -79.04
    ## 5 Institutional    35.90    -79.01
    ## 6 Institutional    35.94    -79.06
    ## 7   Residential    35.89    -79.07
    ## 8   Residential    35.94    -79.01
    ## 9   Residential    35.87    -79.08
    
    We will use GPS (on our cell phones) to navigate to each one of these points and locate the nearest stop sign in a systematic fashion. We will need to agree on an exact protocol for this to make it truly systematic random sampling. The stop sign closest to each point will constitute our sample population.
    Once we have selected our sampling units, we will stratify again by day of week and time of day. More on this in a future post!
    1

    View comments

  8. A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package. It was a pleasure to use and at one point I had a question and I posted on his GitHub and he responded helpfully and promptly. What a great guy! Here's a link to the final product posted on RPubs:
    http://rpubs.com/rnorberg/3117

    Check it out, it's full of gorgeous figures and formatting. It's really an accomplishment to get something this nice looking out of a command line program. Most of the credit for the formatting beauty goes to the Slidify package. I uses markdown in much the same way as knitr does, so if you're familiar with that, using Slidify isn't a big jump.

    Some comments on the whole experience:

    1. While Slidify does a nice job with formatting (notice in particular headings, regular text, bullets, and set off code chunks with the different background color), especially for markdown, it's not perfect. For example I had trouble gauging the appropriate amount of material for each slide. If you try to cram too much on one slide, instead of compressing to fit all of it (as PowerPoint would), your text/figure/whatever just hangs off of the bottom of the slide out of sight. It was frustrating to have to guess and check the amount of material in each slide, especially with figures and material I decided to go back and add.
    2. I loved the fact that I could add images from my hard drive, not just images generated in R. You'll notice in my slideshow that I made use of this when I took screen shots of reshapeGUI in action and included them. This was really easy to do and a great feature. I would say the same for hyperlinks. These were easy to add and extremely useful since I ended up publishing the presentation to the web. Those watching didn't have to jot down a link or a book title every time I suggested a good place to find extra info about something. I just emailed around a link to the presentation and everyone had all of the resources just a click away.
    3. This brings me to the most frustrating part of the whole thing: Publishing online. Now this is not an issue with Slidify, but rather with Rpubs and GitHub. I'd never published anything from R directly to either, and setting this functionality up proved extremely painful. I first tried to push the thing to GitHub because I already had a GitHub account, but I never managed to figure it out. After much frustration I tried Rpubs. By pure determination I finally stumbled my way through the setup process for that and eventually published the presentation to my new Rpubs account. I honestly don't even remember what all I had to do, but I remember this being incredibly frustrating. The documentation accompanying Slidify could be improved by adding a how-to section for those who have never uploaded to either online repository. UPDATE: The author of the package pointed out to me that you can also publish your slideshow using Dropbox. This is done simply by saving the slideshow into your Dropbox folder. I wish I'd known this, because you can't get mush easier than that!
    In summation, I was really impressed with Slidify, but in the end, there was no reason to use this instead of a traditional PowerPoint (other than to show off that I had made my presentation about R, in R). PowerPoint (or Google Docs or some other free presentation software) would allow me the same functionality and more without having to struggle from behind a command line. The only reason I can see to do a presentation this way is if you might need to update it frequently (such as update regularly with recent data, etc), much the same type of things you might use a markdown document for. The problem with this is as your data changes, your output and figures will change, but your text discussing it will not, which is dangerous. The same goes for markdown documents as well though, and those get used quite frequently. So overall, I highly recommend Slidify, but you need to have the right reasons to do so. Happy presenting!
    0

    Add a comment

  9. In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator. Last, we took a stratified random sample by marking 50 rectangles as "Large" and 50 as "Small", then randomly selecting 5 from each strata. Our estimates of the total area in all 100 rectangles and their 95% confidence intervals are given in the plot above, along with the true value. Our experiment turned out exactly how it was supposed to. Our convenience sample had the largest variability and our stratified sample the smallest. All 3 confidence intervals captured the true value, as you would expect to happen 95% of the time. I would share my R code with the figure, but it's really sloppy and not nearly as nice as the succinct confirmation of statistical principles offered by just the image.
    0

    Add a comment


  10. For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly. Here's our first bit of work in mySQL, solved in R:



    BIOL 525 Lecture 3: In class work

    Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

    From the Downloads folder in the Resources section of the wiki:

    1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

    2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

    I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the read.table() command. This is the simplest way of reading data into R. Note the argument sep='\t' in the read.table() command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.
    attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt", 
        sep = "\t")
    mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId", 
        "transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId", 
        "havanaTranscriptId", "ccdsId", "level", "transcriptClass")
    names(attributes) <- mynames
    
    genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
        sep = "\t")
    mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", 
        "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", 
        "cdsEndStat", "exonFrames")
    names(genes) <- mynames1
    

    Answer the following questions:

    1) How many entries are in the Gencode gene table?

    dim(genes)[1]
    
    ## [1] 167536
    

    2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

    nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846))
    
    ## [1] 142
    

    3) How many of these entries are for genes on the negative strand.

    nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
        strand == "-"))
    
    ## [1] 90
    

    4) How many of these negative strand genes have more than 15 exons.

    nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
        strand == "-" & exonCount > 15))
    
    ## [1] 23
    

    5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

    length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
        41642846 & strand == "-" & exonCount > 15)$name2))
    
    ## [1] 3
    

    6) How many entries (transcripts) are there for BRCA1.

    nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
        strand == "-" & exonCount > 15 & name2 == "BRCA1"))
    
    ## [1] 15
    

    ** Bonus **

    7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

    Hint: You can ORDER BY multiple fields.

    my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
        41642846 & strand == "-" & exonCount > 15)$name2)
    unique(subset(attributes, geneName %in% my3genes)$geneStatus)
    
    ## [1] KNOWN
    ## Levels: KNOWN NOVEL PUTATIVE
    0

    Add a comment

My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.