Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal. This workflow is great for many reasons that I won't get into, but one major shortcoming is how difficult it is to get a nice looking table out of it. I was working on a report today in which I am cleaning some data for my boss. I'll quickly replicate this for effect.

I first plot the data, then notice several outliers that we should trace back to the database and make sure that the values reported are indeed correct and not a typo etc.
rm(list = ls())
library(pander)
## Attaching package: 'pander'
## The following object is masked from 'package:knitr':
## 
## pandoc
library(ggplot2)
mydat <- diamonds

my.plot <- ggplot(mydat, aes(x = cut(carat, 0:6), y = log(price))) + facet_grid(. ~ 
    color, labeller = label_both) + geom_boxplot(fill = "grey", lwd = 0.9)
print(my.plot)
plot of chunk ChunkSettings
I see what might be a couple outliers here. (This is actually a cannonical and therefore very clean dataset, but just bear with me.)
mydat$Outlier <- F
mydat$Outlier[mydat$color == "J" & mydat$carat > 3 & mydat$carat < 4 & mydat$price < 
    exp(9.5)] <- T
mydat$Outlier[mydat$color == "I" & mydat$carat > 3 & mydat$carat < 4 & mydat$price < 
    exp(9.1)] <- T
mydat$Outlier[mydat$color == "I" & mydat$carat > 2 & mydat$carat < 3 & mydat$price < 
    exp(8.7)] <- T
mydat$Outlier[mydat$color == "H" & mydat$carat > 3 & mydat$carat < 4 & mydat$price > 
    exp(9.5)] <- T
mydat$Outlier[mydat$color == "E" & mydat$carat > 1 & mydat$carat < 2 & mydat$price < 
    exp(7.5)] <- T

print(my.plot + geom_point(data = subset(mydat, Outlier == T), color = "red", 
    size = 4))
plot of chunk unnamed-chunk-1
I might want to send these points back to the lab or whoever's in charge of data collection so that they can verify that they're good observations and not typos etc. This is where I'd like to put a good table into my report, with details about these perceived “outliers” so that these points can be checked. It would be easy enough to track this stuff down myself and put it in an email or something, but for argument's sake, I want to automate this report so that every time I get an updated database I can spit one of these guys out without any actual work.
I could simply do this:
subset(mydat, Outlier == T)
##       carat     cut color clarity depth table price    x    y    z Outlier
## 13992  2.01    Fair     I      I1  67.4    58  5696 7.71 7.64 5.17    TRUE
## 13993  2.01    Fair     I      I1  55.9    64  5696 8.48 8.39 4.71    TRUE
## 19340  3.01 Premium     I      I1  62.7    58  8040 9.10 8.97 5.67    TRUE
## 21759  3.11    Fair     J      I1  65.9    57  9823 9.15 9.02 5.98    TRUE
## 27650  3.01    Good     H     SI2  57.6    64 18593 9.44 9.38 5.42    TRUE
## 41919  1.03    Fair     E      I1  78.2    54  1262 5.72 5.59 4.42    TRUE
But that's ugly. I especially hate the row names and the hashes. The hashes are good when you're writing something you expect to be copy and pasted into an R terminal, so that whoever's doing the copy and pasting gets the code copied in and evaluated, but the results don't get evaluated because they are commented out. In my reports I'm actually hiding all of the code and I'd like my results to look a little neater.
To get rid of the rownames - slightly more verbose, but also slightly neater:
print(subset(mydat, Outlier == T), row.names = F)
##  carat     cut color clarity depth table price    x    y    z Outlier
##   2.01    Fair     I      I1  67.4    58  5696 7.71 7.64 5.17    TRUE
##   2.01    Fair     I      I1  55.9    64  5696 8.48 8.39 4.71    TRUE
##   3.01 Premium     I      I1  62.7    58  8040 9.10 8.97 5.67    TRUE
##   3.11    Fair     J      I1  65.9    57  9823 9.15 9.02 5.98    TRUE
##   3.01    Good     H     SI2  57.6    64 18593 9.44 9.38 5.42    TRUE
##   1.03    Fair     E      I1  78.2    54  1262 5.72 5.59 4.42    TRUE
And to get rid of the hashes:
opts_chunk$set(comment = NA)  # This makes the change globally for every chunk after this one
print(subset(mydat, Outlier == T), row.names = F)
 carat     cut color clarity depth table price    x    y    z Outlier
  2.01    Fair     I      I1  67.4    58  5696 7.71 7.64 5.17    TRUE
  2.01    Fair     I      I1  55.9    64  5696 8.48 8.39 4.71    TRUE
  3.01 Premium     I      I1  62.7    58  8040 9.10 8.97 5.67    TRUE
  3.11    Fair     J      I1  65.9    57  9823 9.15 9.02 5.98    TRUE
  3.01    Good     H     SI2  57.6    64 18593 9.44 9.38 5.42    TRUE
  1.03    Fair     E      I1  78.2    54  1262 5.72 5.59 4.42    TRUE
This looks better certainly, but still doesn't look like a table. I can get a little fancier bay calling the pander() function in the pander package, which applies formatting to my table for me:
pander(subset(mydat, Outlier == T))

-----------------------------------------------------
  &nbsp;     carat    cut    color   clarity   depth 
----------- ------- ------- ------- --------- -------
 **13992**   2.01    Fair      I       I1      67.4  

 **13993**   2.01    Fair      I       I1      55.9  

 **19340**   3.01   Premium    I       I1      62.7  

 **21759**   3.11    Fair      J       I1      65.9  

 **27650**   3.01    Good      H       SI2     57.6  

 **41919**   1.03    Fair      E       I1      78.2  
-----------------------------------------------------

Table: Table continues below


----------------------------------------------------
  &nbsp;     table   price   x    y    z    Outlier 
----------- ------- ------- ---- ---- ---- ---------
 **13992**    58     5696   7.71 7.64 5.17   TRUE   

 **13993**    64     5696   8.48 8.39 4.71   TRUE   

 **19340**    58     8040   9.1  8.97 5.67   TRUE   

 **21759**    57     9823   9.15 9.02 5.98   TRUE   

 **27650**    64     18593  9.44 9.38 5.42   TRUE   

 **41919**    54     1262   5.72 5.59 4.42   TRUE   
----------------------------------------------------
But the rownames are back. Ugh. Frustratingly, this is the best solution I've found to this so far:
temp <- subset(mydat, Outlier == T)
rownames(temp) <- NULL
pander(temp)

---------------------------------------------------------
 carat    cut    color   clarity   depth   table   price 
------- ------- ------- --------- ------- ------- -------
 2.01    Fair      I       I1      67.4     58     5696  

 2.01    Fair      I       I1      55.9     64     5696  

 3.01   Premium    I       I1      62.7     58     8040  

 3.11    Fair      J       I1      65.9     57     9823  

 3.01    Good      H       SI2     57.6     64     18593 

 1.03    Fair      E       I1      78.2     54     1262  
---------------------------------------------------------

Table: Table continues below


------------------------
 x    y    z    Outlier 
---- ---- ---- ---------
7.71 7.64 5.17   TRUE   

8.48 8.39 4.71   TRUE   

9.1  8.97 5.67   TRUE   

9.15 9.02 5.98   TRUE   

9.44 9.38 5.42   TRUE   

5.72 5.59 4.42   TRUE   
------------------------
I also don't really care for the way the table breaks off and then continues underneath. To change this, I must specify the option table.split.table in the pander package. According to the pander documentation
This option tells pander where to split too wide tables. The default value (80) suggests the conventional number of characters used in a line, feel free to change (e.g. to Inf to disable this feature) if you are not using a VT100 terminal any more :)
Don't mind if I do.
panderOptions("table.split.table", Inf) 
pander(temp)

----------------------------------------------------------------------------------
 carat    cut    color   clarity   depth   table   price   x    y    z    Outlier 
------- ------- ------- --------- ------- ------- ------- ---- ---- ---- ---------
 2.01    Fair      I       I1      67.4     58     5696   7.71 7.64 5.17   TRUE   

 2.01    Fair      I       I1      55.9     64     5696   8.48 8.39 4.71   TRUE   

 3.01   Premium    I       I1      62.7     58     8040   9.1  8.97 5.67   TRUE   

 3.11    Fair      J       I1      65.9     57     9823   9.15 9.02 5.98   TRUE   

 3.01    Good      H       SI2     57.6     64     18593  9.44 9.38 5.42   TRUE   

 1.03    Fair      E       I1      78.2     54     1262   5.72 5.59 4.42   TRUE   
----------------------------------------------------------------------------------
If I set table.split.table to Inf however, when I convert the .md file to a .docx file the table will run over onto the next line in Word and it will look awful like the one below.
alt text
So I set the table.split.table to 105 instead, which seems to be about as much as Word can handle on one line given the font and so forth that I'm using.
panderOptions("table.split.table", 105)
Getting there, but can we do better? The pander() function has a number of options for displaying tables. You can set the table style within the call to pander() (which is just a wrapper for pandoc.table()), but I prefer to set these options globally so that all of my tables will look the same when I generate the report:
panderOptions("table.style", "multiline")  # The default
pander(temp)

------------------------------------------------------------------------
 carat    cut    color   clarity   depth   table   price   x    y    z  
------- ------- ------- --------- ------- ------- ------- ---- ---- ----
 2.01    Fair      I       I1      67.4     58     5696   7.71 7.64 5.17

 2.01    Fair      I       I1      55.9     64     5696   8.48 8.39 4.71

 3.01   Premium    I       I1      62.7     58     8040   9.1  8.97 5.67

 3.11    Fair      J       I1      65.9     57     9823   9.15 9.02 5.98

 3.01    Good      H       SI2     57.6     64     18593  9.44 9.38 5.42

 1.03    Fair      E       I1      78.2     54     1262   5.72 5.59 4.42
------------------------------------------------------------------------

Table: Table continues below


---------
 Outlier 
---------
  TRUE   

  TRUE   

  TRUE   

  TRUE   

  TRUE   

  TRUE   
---------
panderOptions("table.style", "simple")
pander(temp)


 carat    cut    color   clarity   depth   table   price   x    y    z  
------- ------- ------- --------- ------- ------- ------- ---- ---- ----
 2.01    Fair      I       I1      67.4     58     5696   7.71 7.64 5.17
 2.01    Fair      I       I1      55.9     64     5696   8.48 8.39 4.71
 3.01   Premium    I       I1      62.7     58     8040   9.1  8.97 5.67
 3.11    Fair      J       I1      65.9     57     9823   9.15 9.02 5.98
 3.01    Good      H       SI2     57.6     64     18593  9.44 9.38 5.42
 1.03    Fair      E       I1      78.2     54     1262   5.72 5.59 4.42

Table: Table continues below



 Outlier 
---------
  TRUE   
  TRUE   
  TRUE   
  TRUE   
  TRUE   
  TRUE   
panderOptions("table.style", "grid")
pander(temp)


+---------+---------+---------+-----------+---------+---------+---------+------+------+------+
|  carat  |   cut   |  color  |  clarity  |  depth  |  table  |  price  |  x   |  y   |  z   |
+=========+=========+=========+===========+=========+=========+=========+======+======+======+
|  2.01   |  Fair   |    I    |    I1     |  67.4   |   58    |  5696   | 7.71 | 7.64 | 5.17 |
+---------+---------+---------+-----------+---------+---------+---------+------+------+------+
|  2.01   |  Fair   |    I    |    I1     |  55.9   |   64    |  5696   | 8.48 | 8.39 | 4.71 |
+---------+---------+---------+-----------+---------+---------+---------+------+------+------+
|  3.01   | Premium |    I    |    I1     |  62.7   |   58    |  8040   | 9.1  | 8.97 | 5.67 |
+---------+---------+---------+-----------+---------+---------+---------+------+------+------+
|  3.11   |  Fair   |    J    |    I1     |  65.9   |   57    |  9823   | 9.15 | 9.02 | 5.98 |
+---------+---------+---------+-----------+---------+---------+---------+------+------+------+
|  3.01   |  Good   |    H    |    SI2    |  57.6   |   64    |  18593  | 9.44 | 9.38 | 5.42 |
+---------+---------+---------+-----------+---------+---------+---------+------+------+------+
|  1.03   |  Fair   |    E    |    I1     |  78.2   |   54    |  1262   | 5.72 | 5.59 | 4.42 |
+---------+---------+---------+-----------+---------+---------+---------+------+------+------+

Table: Table continues below



+-----------+
|  Outlier  |
+===========+
|   TRUE    |
+-----------+
|   TRUE    |
+-----------+
|   TRUE    |
+-----------+
|   TRUE    |
+-----------+
|   TRUE    |
+-----------+
|   TRUE    |
+-----------+
panderOptions("table.style", "rmarkdown")
pander(temp)


|  carat  |   cut   |  color  |  clarity  |  depth  |  table  |  price  |  x   |  y   |  z   |
|:-------:|:-------:|:-------:|:---------:|:-------:|:-------:|:-------:|:----:|:----:|:----:|
|  2.01   |  Fair   |    I    |    I1     |  67.4   |   58    |  5696   | 7.71 | 7.64 | 5.17 |
|  2.01   |  Fair   |    I    |    I1     |  55.9   |   64    |  5696   | 8.48 | 8.39 | 4.71 |
|  3.01   | Premium |    I    |    I1     |  62.7   |   58    |  8040   | 9.1  | 8.97 | 5.67 |
|  3.11   |  Fair   |    J    |    I1     |  65.9   |   57    |  9823   | 9.15 | 9.02 | 5.98 |
|  3.01   |  Good   |    H    |    SI2    |  57.6   |   64    |  18593  | 9.44 | 9.38 | 5.42 |
|  1.03   |  Fair   |    E    |    I1     |  78.2   |   54    |  1262   | 5.72 | 5.59 | 4.42 |

Table: Table continues below



|  Outlier  |
|:---------:|
|   TRUE    |
|   TRUE    |
|   TRUE    |
|   TRUE    |
|   TRUE    |
|   TRUE    |
Now perhaps you've heard of the package xtable. This is a nice package for making .html tables in your documents. I've used this before and I'm just copy and pasting a couple lines from a report I wrote a few months ago:
library(xtable)
print(xtable(temp, align = rep("c", dim(temp)[2] + 1)), type = "html")
<!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
<!-- Wed Jun 19 10:40:22 2013 -->
<TABLE border=1>
<TR> <TH>  </TH> <TH> carat </TH> <TH> cut </TH> <TH> color </TH> <TH> clarity </TH> <TH> depth </TH> <TH> table </TH> <TH> price </TH> <TH> x </TH> <TH> y </TH> <TH> z </TH> <TH> Outlier </TH>  </TR>
  <TR> <TD align="center"> 1 </TD> <TD align="center"> 2.01 </TD> <TD align="center"> Fair </TD> <TD align="center"> I </TD> <TD align="center"> I1 </TD> <TD align="center"> 67.40 </TD> <TD align="center"> 58.00 </TD> <TD align="center"> 5696 </TD> <TD align="center"> 7.71 </TD> <TD align="center"> 7.64 </TD> <TD align="center"> 5.17 </TD> <TD align="center"> TRUE </TD> </TR>
  <TR> <TD align="center"> 2 </TD> <TD align="center"> 2.01 </TD> <TD align="center"> Fair </TD> <TD align="center"> I </TD> <TD align="center"> I1 </TD> <TD align="center"> 55.90 </TD> <TD align="center"> 64.00 </TD> <TD align="center"> 5696 </TD> <TD align="center"> 8.48 </TD> <TD align="center"> 8.39 </TD> <TD align="center"> 4.71 </TD> <TD align="center"> TRUE </TD> </TR>
  <TR> <TD align="center"> 3 </TD> <TD align="center"> 3.01 </TD> <TD align="center"> Premium </TD> <TD align="center"> I </TD> <TD align="center"> I1 </TD> <TD align="center"> 62.70 </TD> <TD align="center"> 58.00 </TD> <TD align="center"> 8040 </TD> <TD align="center"> 9.10 </TD> <TD align="center"> 8.97 </TD> <TD align="center"> 5.67 </TD> <TD align="center"> TRUE </TD> </TR>
  <TR> <TD align="center"> 4 </TD> <TD align="center"> 3.11 </TD> <TD align="center"> Fair </TD> <TD align="center"> J </TD> <TD align="center"> I1 </TD> <TD align="center"> 65.90 </TD> <TD align="center"> 57.00 </TD> <TD align="center"> 9823 </TD> <TD align="center"> 9.15 </TD> <TD align="center"> 9.02 </TD> <TD align="center"> 5.98 </TD> <TD align="center"> TRUE </TD> </TR>
  <TR> <TD align="center"> 5 </TD> <TD align="center"> 3.01 </TD> <TD align="center"> Good </TD> <TD align="center"> H </TD> <TD align="center"> SI2 </TD> <TD align="center"> 57.60 </TD> <TD align="center"> 64.00 </TD> <TD align="center"> 18593 </TD> <TD align="center"> 9.44 </TD> <TD align="center"> 9.38 </TD> <TD align="center"> 5.42 </TD> <TD align="center"> TRUE </TD> </TR>
  <TR> <TD align="center"> 6 </TD> <TD align="center"> 1.03 </TD> <TD align="center"> Fair </TD> <TD align="center"> E </TD> <TD align="center"> I1 </TD> <TD align="center"> 78.20 </TD> <TD align="center"> 54.00 </TD> <TD align="center"> 1262 </TD> <TD align="center"> 5.72 </TD> <TD align="center"> 5.59 </TD> <TD align="center"> 4.42 </TD> <TD align="center"> TRUE </TD> </TR>
   </TABLE>
Ok this looks pretty silly. It's not even a table! Well, you have to use xtable() in conjucntion with the knitr chunk setting results = 'asis'. You don't want to do this globally to all chunks in your report - only to the ones that generate a table. If you don't know how to do this, I'll leave you to find it in the knitr documentation on your own bacause I'm using RStudio and that's kind of a game changer. Chances are if you got this far in the post, you know how to do this.
Note: The align=rep('c',dim(temp)[2]+1)) part of the call to xtable() is to center align my values inside the cells of the table. You might have noticed this was the default in all of my calls to pander().
Also Note: The rownames are back, damn them. I'm not going to demonstrate getting rid of them because I suffer from severe apathy.
print(xtable(temp, align = rep("c", dim(temp)[2] + 1)), type = "html")
caratcutcolorclaritydepthtablepricexyzOutlier
12.01FairII167.4058.0056967.717.645.17TRUE
22.01FairII155.9064.0056968.488.394.71TRUE
33.01PremiumII162.7058.0080409.108.975.67TRUE
43.11FairJI165.9057.0098239.159.025.98TRUE
53.01GoodHSI257.6064.00185939.449.385.42TRUE
61.03FairEI178.2054.0012625.725.594.42TRUE
Ok, so when I supply the setting results = 'asis' to my xtable() chunk, I get a nice table in .html, but when I use pander to convert it to a .docx (which we'll get to in a minute), it absolutely butchers it. My little table now spans more than 3 pages in word:
alt text
So xtable() is great for .html tables, but it is often verbose to use it and it can be difficult to make even small tweaks which pander() handles pretty easily. The dealbreaker is really its incompatibility with conversion to word via Pandoc.convert() though.
But I have a hunch here. What happens if I try this results='asis' thing on some of my pander() chunks?
Note: I'm sure this is in the pander package documentation somewhere online, but I couldn't find anything super clear on it.
panderOptions("table.style", "multiline")  # Used in conjunction with the chunk setting results='asis'
pander(temp)

carat cut color clarity depth table price x y z

2.01 Fair I I1 67.4 58 5696 7.71 7.64 5.17
2.01 Fair I I1 55.9 64 5696 8.48 8.39 4.71
3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67
3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98
3.01 Good H SI2 57.6 64 18593 9.44 9.38 5.42

1.03 Fair E I1 78.2 54 1262 5.72 5.59 4.42

Table: Table continues below

Outlier

TRUE
TRUE
TRUE
TRUE
TRUE

TRUE

Not quite what I'm looking for…
panderOptions("table.style", "simple")  # Used in conjunction with the chunk setting results='asis'
pander(temp)
carat cut color clarity depth table price x y z

2.01 Fair I I1 67.4 58 5696 7.71 7.64 5.17 2.01 Fair I I1 55.9 64 5696 8.48 8.39 4.71 3.01 Premium I I1 62.7 58 8040 9.1 8.97 5.67 3.11 Fair J I1 65.9 57 9823 9.15 9.02 5.98 3.01 Good H SI2 57.6 64 18593 9.44 9.38 5.42 1.03 Fair E I1 78.2 54 1262 5.72 5.59 4.42
Table: Table continues below

Outlier

TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
Still no good…
panderOptions("table.style", "grid")  # Used in conjunction with the chunk setting results='asis'
pander(temp)
+———+———+———+———–+———+———+———+——+——+——+ | carat | cut | color | clarity | depth | table | price | x | y | z | +=========+=========+=========+===========+=========+=========+=========+======+======+======+ | 2.01 | Fair | I | I1 | 67.4 | 58 | 5696 | 7.71 | 7.64 | 5.17 | +———+———+———+———–+———+———+———+——+——+——+ | 2.01 | Fair | I | I1 | 55.9 | 64 | 5696 | 8.48 | 8.39 | 4.71 | +———+———+———+———–+———+———+———+——+——+——+ | 3.01 | Premium | I | I1 | 62.7 | 58 | 8040 | 9.1 | 8.97 | 5.67 | +———+———+———+———–+———+———+———+——+——+——+ | 3.11 | Fair | J | I1 | 65.9 | 57 | 9823 | 9.15 | 9.02 | 5.98 | +———+———+———+———–+———+———+———+——+——+——+ | 3.01 | Good | H | SI2 | 57.6 | 64 | 18593 | 9.44 | 9.38 | 5.42 | +———+———+———+———–+———+———+———+——+——+——+ | 1.03 | Fair | E | I1 | 78.2 | 54 | 1262 | 5.72 | 5.59 | 4.42 | +———+———+———+———–+———+———+———+——+——+——+
Table: Table continues below
+———–+ | Outlier | +===========+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+ | TRUE | +———–+
Definitely not…
panderOptions("table.style", "rmarkdown")  # Used in conjunction with the chunk setting results='asis'
pander(temp)
caratcutcolorclaritydepthtablepricexyz
2.01FairII167.45856967.717.645.17
2.01FairII155.96456968.488.394.71
3.01PremiumII162.75880409.18.975.67
3.11FairJI165.95798239.159.025.98
3.01GoodHSI257.664185939.449.385.42
1.03FairEI178.25412625.725.594.42
Table: Table continues below
Outlier
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
Bingo! This looks good it .html, but will it also look good once I convert it to Word? Interestingly, all of the pander() style tables convert nicely into a table object in Word when I convert into a .docx when I specify results = 'asis'. I can manipulate these tables as such, sorting columns etc. This is what I've been after!
Note: If your table runneth over (i.e. it is wide enough that you invoke the ugly “Table: Table continues below” message from pander()), the “below” part doesn't always get turned into a table like it ought to. The only pander() style that had trouble with this was the 'rmarkdown' style. The other 3 styles, once converted to .docx yield a table, the message “Table continues below”, and then another table object below. As such, I recommend usingpanderOptions('table.split.table', Inf) so that everything gets crammed into one table no matter what. You can then resize in Word as necessary.
Real quick, here's the code for the conversion into a .docx:
Pandoc.convert("Tables with knitr and pander.md", format = "docx")  # Assuming my working directory is set to where this file is saved
So to recap, here's what you need to do to get a good looking report if you use the same workflow as I do:
  1. Set panderOptions('table.split.table', Inf)
  2. Set panderOptions('table.style', 'rmarkdown') I like this one best because it looks good in .html and in Word. (Do 1 and 2 just once, at the beggining of your .Rmd script)
  3. Set the individual chunk option results = 'asis'
  4. Get rid of those stupid rownames

Go make some tables!

2

View comments

Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.

Understanding data.table Rolling Joins

Robert Norberg

June 5, 2016

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
2

A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance

Robert Norberg

Monday, April 06, 2015

Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
3

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
2

Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
2

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.

Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
1

A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.

In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.

For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.
My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.