In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator. Last, we took a stratified random sample by marking 50 rectangles as "Large" and 50 as "Small", then randomly selecting 5 from each strata. Our estimates of the total area in all 100 rectangles and their 95% confidence intervals are given in the plot above, along with the true value. Our experiment turned out exactly how it was supposed to. Our convenience sample had the largest variability and our stratified sample the smallest. All 3 confidence intervals captured the true value, as you would expect to happen 95% of the time. I would share my R code with the figure, but it's really sloppy and not nearly as nice as the succinct confirmation of statistical principles offered by just the image.
Aug
30
Data Splitting: Time Slices With an Irregular Time Series
Purpose
The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.
The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.
Jun
5
Understanding data.table Rolling Joins
Understanding data.table Rolling Joins
Robert Norberg
June 5, 2016
Introduction
Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.
library(data.table)
The Setup
Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
Robert Norberg
June 5, 2016
Introduction
Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.
library(data.table)
The Setup
Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
Apr
5
A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance
A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance
Robert Norberg
Monday, April 06, 2015
Introduction
In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
Robert Norberg
Monday, April 06, 2015
Introduction
In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
Mar
6
Getting Data From An Online Source
Getting Data From One Online Source
Robert Norberg
Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
Robert Norberg
Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
Jun
19
Generating Tables Using Pander, knitr, and Rmarkdown
Generating Tables Using Pander, knitr, and Rmarkdown
I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
Mar
8
R vs. Perl/mySQL - an applied genomics showdown
R vs. Perl/mySQL - an applied genomics showdown
Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.
Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.
Feb
26
Stop Sign Project Post1: Some GIS stuff done in R
Stop Sign Sampling Project
Post 1: Planning Phase
Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
Post 1: Planning Phase
Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
Feb
7
Slideshows in R
A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.
Feb
4
Convenience Sample, SRS, and Stratified Random Sample Compared
In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.
Jan
22
SQL commands in R
For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.
Add a comment