Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill. This will obviously require a well thought out survey design, some GIS tasks and of course some survey data analysis. Here's what we've got so far:
I'm using a bunch of packages here. This is really a great example of how awesome R is. So many people contribute to make R capable of just about anything you can imagine. Thanks to all of the package authors out there!
# Clear memory
rm(list = ls())
# Load packages I will need later - This is a pretty package intensive
# project
require(ggplot2)
require(ggmap)
require(sp)
require(rgdal)
require(foreign)
require(grid)
require(lattice)
require(maptools)
require(splancs)
require(plyr)
require(gpclib)
require(xtable)
require(bitops)
require(RCurl)
gpclibPermit()  # required for fortify method

Objective:

Answer the research question “What proportion of drivers come to a proper 3 second stop at stop signs in Chapel Hill?” After interviewing Officer Kuhns of the Chapel Hill Police Department, we define a proper stop as “no tire rotation.”

Target Population:

All cars at all stop signs in Chapel Hill.

Sampling Unit:

Stop signs

Observation Unit:

Individual cars

Strategy

The simplest way to approach this research question would be to obtain a directory of all stop signs within our target popualtion and randomly select a sample.
Unfortunately, no such directory exists. To approximate such a strategy we will use systematic random sampling to select a random sample of stop signs in Chapel Hill. We will randomly generate a point on the map that lies within the boundries of Chapel Hill and select the nearest stop sign to that point to be included in our sample. This presents a problem, however, because not all stop signs will have an equal probability of selesction. Those that are not near any others will have a higher probability of selection, while thos that are in regions of particularly dense “stop sign-age” will have a lower probability of selection. To remedy this, we have divided Chapel Hill into zones that we expect to have similar densities of stop signs: residential, commercial, and institutional zones. We will treat the zones as strata and sample equally from each one.
We have obtained a GIS layer from the Town of Chapel Hill that clearly defines what is inside and what is outside of the town limits, as well 35 different “zones” as defined by the town of Chapel Hill. Find it here and download it to your computer, then load it using readOGR() or follow the procedure given below to download it into a temporary folder, unzip it, and read it in - all from within R. Now that's pretty sweet.
# create a temporary file and a temporary directory on your local disk
tf <- tempfile()
td <- tempdir()

# run the download file function, download the zipped file (zipped files
# are binary) and save the result to the temporary file
download.file("http://gis.townofchapelhill.org/GIS_Data/zoning_districts.zip", 
    tf, mode = "wb")

# unzip the files to the temporary directory
files <- unzip(tf, exdir = td)  # unzipped files are now stored in the object `files`

# give the path (stored in `td`) to the argument 'dsn' and the filename
# without any filetype to the argument 'layer'
zones <- readOGR(dsn = td, layer = "zoning_districts", verbose = F)
zones1 <- spTransform(zones, CRS("+proj=longlat +ellps=WGS84"))
Next, I'm going to use the fantastic package ggmap to get a base layer for my map to plot my polygons over. This is made extremely simple by the get_map() function in ggmap. I simply give it a location to be the center of my map and define my “zoom” (this takes a little trial and error, but the rule of thumb is that zoom=0 is the whole world, zoom=10is about the city level, and zoom=21 is building level). There are also a number of maptypes you can choose from including some interesting ones like “watercolor”, but for this project, a roadmap will suffice. A call to get_map() queries Google Maps and gets the map layer requested. Notice the Google watermark in the bottom corner of the map.get_map() must save the layer that it tracked down somewhere and so it creates a temporary file in whatever path is specified by your working directory at the time of the call to get_map(). I've occasionally had some trouble with get_map() and I think it's due to one of two reasons:
1) It has trouble when my working directory is set to a path with spaces somewhere in it (for example “C:// Users/rnorberg/Stop Sign Project”), or
2) It has trouble when my working directory is set to a path in which I do not have permission to save/edit files. I'm on a computer that the school gave me and I've never really taken the time to figure out the security settings on it.
So I set my working directory to be my desktop because I know for a fact I have permission to save in it and has no spaces in its path.
setwd("C://Users/rnorberg/Desktop")
map1 <- get_map(location = "Chapel Hill, NC", zoom = 12, maptype = "roadmap")
Ok, now away from the technical stuff and back to the actual project.
Our intention is to treat the zones in our zoning layer as strata and 35 strata seems like a lot (keep in mind strata implies you must sample within each and I don't want to have to sample at 35+ stop signs), so we collapse these 35 zones into 3 major categories: residential, commercial, and institutional. A map of some of these zones with a key to elucidate what each zone is (they are named with rather uninformative abbreviations in the dataset) can be found here.
Commercial_Districts <- c("TC-1", "TC-2", "TC-3-C", "CC", "NC", "MU-V", "MU-OI-1", 
    "MU-R-1", "IND", "CC-C", "TC-2-C", "NC-C", "TC-1-C")
Institutional_Districts <- c("OI-1", "OI-2", "OI-3", "OI-4", "U-1", "OI-2-C", 
    "OI-1-C")
Residential_Districts <- c("R-LD5", "RI", "R-LD1", "R-1A", "R-1", "R-2", "R-2A", 
    "R-3", "R-4", "R-5", "R-6", "R-SS-C", "RT", "R-3-C", "R-5-C", "R-4-C")

zones1$District <- factor(NA, levels = c("Commercial", "Residential", "Institutional"))
zones1$District[zones1$ZONING %in% Commercial_Districts] <- "Commercial"
zones1$District[zones1$ZONING %in% Residential_Districts] <- "Residential"
zones1$District[zones1$ZONING %in% Institutional_Districts] <- "Institutional"
Basically, what I've done here is created an empty categorical variable with 3 possible values (Commercial, Residential, and Institutional), then searched through the variable zones1$ZONING and if it's value is also a value in my vector Commercial_Districts, I have assigned the value 'Commercial' to my new variable zones1$District. Then I repeat this for my other two District vectors.
Next I plot the modified layer. Before I do so, I call the fortify() command to transform my object zones1 into an object that the graphics package ggplot2 will recognize. Currently, the object zones1 is of class SpatialPolygonsDataFrame, but ggplot2 will only plot data.frames. I call fortify, adding the argument group='District' to preserve the value of 'District' attached to each polygon. I store the resulting object in a new object, zones2, which is of class data.frame.
zones2 <- fortify(zones1, region = "District")
Now I can finally plot the polygons! I first plot the base layer stored in the object map1, then I add a geom_polygon on top of it, specifying that this procedure should use the zones2 data frame, then mapping certain aesthetics of the figure to certain variables in zones2 (the stuff inside the aes() argument), and finally specifying that the polygons should be semi-transparent (alpha=0.3). Inside the aes() argument I tell ggplot() that the x axis of the figure should correspond to the variable long in zones2, the y axis should correspond to the variable lat in zones2, and the fill (the color of the inside of the polygons) and the color (the color of the polygon edges) should correspond to thegroup variable which was created from District when we specified region='District' in the call to fortify().
ggmap(map1)+
  geom_polygon(data=zones2, aes(x=long, y=lat, fill=group, color=group), alpha=.3)+
  theme_bw()+ # slightly modifies the appearance of the figure, only cosmetic
  labs(x='Longitude', y='Latitude') # add my own labels, nicer than the default "long" and "lat"
plot of chunk unnamed-chunk-7
Well, not bad, but that legend is horrific. It looks like for each individual polygon in the layer (there are many) fortify() has assigned a separate value of group (Residential.1, Residential.2, Residential.3 and so on). Lets do away with the legend then.
ggmap(map1)+
  geom_polygon(data=zones2, aes(x=long, y=lat, fill=group, color=group), alpha=.3)+
  guides(color=F, fill=F)+ # remove the legend
  theme_bw()+
  labs(x='Longitude', y='Latitude')
plot of chunk unnamed-chunk-8
That's a little better, but I'm not totally satisfied with the default color scheme, so I'm going to assign my own colors to the polygons using scale_fill_manual() and scale_color_manual(). I must pass these a value for each unique value of group and recall that group has many values, not just the 3 values we originally created in District, so I must give a color value for each of the groups depicted in the messy legend we just got rid of. The polygons will be plotted on alphabetical order by group, so I can use rep() to repeat the same color value one time for each residential polygon, then do the same with a different color value for all of the institutional and commercial polygons.
# how many residential polygons are there?
rz <- length(grep("Residential", unique(zones2$group)))

# how many commercial polygons are there?
cz <- length(grep("Commercial", unique(zones2$group)))

# how many institutional polygons are there?
iz <- length(grep("Institutional", unique(zones2$group)))

ggmap(map1) + geom_polygon(data = zones2, aes(x = long, y = lat, fill = group, 
    color = group), alpha = 0.3) + scale_fill_manual(values = rep(c("red", "green", 
    "blue"), c(cz, iz, rz))) + scale_color_manual(values = rep(c("red", "green", 
    NA), c(cz, iz, rz))) + guides(color = F, fill = F) + theme_bw() + labs(x = "Longitude", 
    y = "Latitude")
plot of chunk unnamed-chunk-9
Now that's pretty snazzy. Notice some of the blue areas and red areas overlap. I'm not sure why or how to fix this, but this figure will do for now.
Again, lets return from the programming tangent and refocus on the big picture.
Now that we've got our map figured out and our 3 geographic stratum defined, we want to randomly generate points (Lattitude and Longitude) within each region. This is pretty straightforward using the spsample() function from the package sp. This uses a pretty fancy algorithm that goes something like this:
1) Define a rectangle using the min and max of longitude and latitude in the specified polygon
2) Randomly generate a longitude and latitude within said rectangle
3) Is the randomly generated point within the polygon? If yes, keep it, if no, throw it away and repeat. This continues until the number of desired points is reached.
The tricky part of this is determining if a point lies within a polygon which likely has many irregular edges. Thankfully, all we have to do is call spsample(), specify an object of class SpatialPolygonsDataframe, specify how many points we would like, and specify that we would like them selected randomly. I do this for the 3 disjoint subsets of zones1 specified in the variable District. Also notice that I set the random seed before I begin this process so that this process is random, but reproducible.
set.seed(2913)
randpoints1 <- spsample(subset(zones1, District == "Commercial"), 3, type = "random")  # this produces an object of class SpatialPoints
randpoints1 <- data.frame(coordinates(randpoints1))  # extract the coordinates from randpoints1 and put into an object of class data.frame

randpoints2 <- spsample(subset(zones1, District == "Institutional"), 3, type = "random")
randpoints2 <- data.frame(coordinates(randpoints2))

randpoints3 <- spsample(subset(zones1, District == "Residential"), 3, type = "random")
randpoints3 <- data.frame(coordinates(randpoints3))
Now lets plot these points to see if we have succeeded. I simply repeat the commands for the previous plot, but add the geom_point() layers, one for each set of 3 points. In each of these 3 layers I specify inherit.aes=F so that the plotting software does not search for variables named longlat, and group in the new data (randpointsX) that were defined in the first aes() call. I then make a new call to aes() to specify the mappings I would like to use with this new data. The last thing I do is specify some aesthetics outside of the aes() argument. These will apply to all points in that layer, and do not map to a particular variable. For example, I would like all points in the datarandpoints1 to be larger than the default size of 1, not just those with a particular x value, so I put size=4 outside of the aes() argument.
ggmap(map1)+
  geom_polygon(data=zones2, aes(x=long, y=lat, fill=group, color=group), alpha=.3)+
  scale_fill_manual(values=rep(c('red', 'green', 'blue'), c(cz,iz,rz)))+
  scale_color_manual(values=rep(c('red', 'green', NA), c(cz,iz,rz)))+
  geom_point(data=randpoints1, inherit.aes=F, aes(x=x, y=y), color='red', pch=18, size=4)+
  geom_point(data=randpoints2, inherit.aes=F, aes(x=x, y=y), color='#006600', pch=18, size=4)+
  geom_point(data=randpoints3, inherit.aes=F, aes(x=x, y=y), color='blue', pch=18, size=4)+
  guides(color=F, fill=F)+
  theme_bw()+
  labs(x='Longitude', y='Latitude')
plot of chunk unnamed-chunk-11
# '#006600' is a particular color of dark green that shows up much better than regular 'green'
This looks like we've been successful! Table 1, below, gives the exact coordinates of these randomly generated coordinates.
table1 <- rbind(randpoints1, randpoints2, randpoints3)
names(table1) <- c("Longitude", "Latitude")
table1$Zone <- rep(c("Commercial", "Institutional", "Residential"), each = 3)
print(table1[, c(3, 2, 1)], rownames = F)
##            Zone Latitude Longitude
## 1    Commercial    35.95    -79.00
## 2    Commercial    35.90    -79.01
## 3    Commercial    35.91    -79.02
## 4 Institutional    35.91    -79.04
## 5 Institutional    35.90    -79.01
## 6 Institutional    35.94    -79.06
## 7   Residential    35.89    -79.07
## 8   Residential    35.94    -79.01
## 9   Residential    35.87    -79.08
We will use GPS (on our cell phones) to navigate to each one of these points and locate the nearest stop sign in a systematic fashion. We will need to agree on an exact protocol for this to make it truly systematic random sampling. The stop sign closest to each point will constitute our sample population.
Once we have selected our sampling units, we will stratify again by day of week and time of day. More on this in a future post!
1

View comments

Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.

Understanding data.table Rolling Joins

Robert Norberg

June 5, 2016

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
2

A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance

Robert Norberg

Monday, April 06, 2015

Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
3

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
2

Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
2

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.

Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
1

A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.

In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.

For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.
My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.