As I’m sure many statisticians do, I keep a folder of “stock code”, or template scripts that do different things. This folder is always growing and the scripts are always improving, but there are a few in there that I think are worth sharing. Some of these are templates for common analyses, while others are just reminders of how to use a couple of commands to accomplish a practical task.
This post is of the latter type. I’m going to discuss fetching data from a URL.
Why might one need to fetch data from a URL?
  • You want to share your code with someone who isn’t familiar with R and you want to avoid the inevitable explanation of how to change the file path at the beginning of the file. (“Make sure you only use forward slashes!”)
  • The data at the URL is constantly changing and you want your analysis to use the latest each time you run it.
  • You want the code to just work when it’s run from another machine with another directory tree.
  • You want to post a completely repeatable analysis on your blog and you don’t want it to begin with “go to www.blahblahblah.com, download this data, and load it into R”.
Whatever your reason may be, it’s a neat trick, but it’s not one I use so often that I can just rattle off the code for it from memory. So here’s my template. I hope it can help someone else.

Caveat!!!

This is only for data that is in tabular form already. This is not for web scraping (i.e. extracting a table of data from a Wikipedia page.) There areentire packages devoted to that. This is for the simplest of all cases where there is a .csv file or a .txt file (or similar) at a URL and you want to read it into R directly from that URL without the intermediate step of saving it somewhere on your computer.

Using data.table’s fread()

I love the data.table package. I use it every day, for almost every project I do. It’s an extension of the data.frame object class in R that makes many improvements. One of those improvements is in the function fread(). It’s data.table’s answer to base R’s read.csv(). It does many things better, but here I’m only going to address its ability to read data right from the web. As a primer, its typical use on a data file residing on your computer would look something like this:
library(data.table)
mydat <- fread('C://Some/File/Path.csv')
Reading data from a source on the web is no different. The example the package authors give in the help file (?fread) is this:
library(data.table)
mydat <- fread('http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat')
head(mydat)
   V1  V2   V3    V4 V5
1:  1 307  930 36.58  0
2:  2 307  940 36.73  0
3:  3 307  950 36.93  0
4:  4 307 1000 37.15  0
5:  5 307 1010 37.23  0
6:  6 307 1020 37.24  0
Now if you actually navigate to that link in your browser, you won’t see anything, but a download dialog should pop up. If you navigate to the parent directory of that address, http://www.stats.ox.ac.uk/pub/datasets/csb you will see some text further down the page you will see several links to data files. Each of these links launches a download dialog when clicked. To grab the URL of the data file to pass to fread(), right click the link and select “Copy link address”. Other data files online might appear in the browser instead of launching download dialog, like this one a professor of mine had us use for an assignment. fread() handles these URLs just the same.
fread() makes smart decisions about how to read the data in (it detects column names and classes and so on), but the command has several arguments for specifying such things as well that you can use at your own discrimination. I find fread('filename') almost always just works, but sometimes there are reasons to be more explicit when reading data in.

Using RStudio

If you’re not familiar with RStudio, you are a true R novice. If you know what it is, but don’t use it, skip ahead.
In RStudio, you can click “Tools” -> “Import Dataset” -> “From Web URL” and a dialog will pop up asking you for a URL. Paste a URL into the dialog box (let’s just use the same one as before: http://www.stats.ox.ac.uk/pub/datasets/csb/ch11b.dat) and click “OK”. A nice little window pops up and allows you to specify how the data should be read and what name the object should be given in R. When you click “Import”, the data is read in and some code appears in the console. What this interface does is download the data to a temporary file in a temporary folder and then read it in. The downloaded data file persists on your hard drive as long as your R session lasts, but disappears as soon as your R session ends.
This is handy, but if you wanted to repeat the process, you would have to click through the menu again and supply the data URL again. This isn’t exactly “repeatable” in the Stack Overflow sense of the word.

Using RCurl’s getURL()

The RCurl package provides bindings to the cURL library. This is a C library for web connections. The cURL library does way more than we need for this task and frankly, I don’t understand a lot of it. I saved RCurl for last because iI usually try fread() first, and then if I get some sort of error, I resort to RCurl. Take for example the data set at this link: https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv (also posted by a professor for an assignment of mine). If you try to fread() it, no dice. I have no idea what that error message means, but here’s how to get that data set in anyway.
library(RCurl)
myfile <- getURL('https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/data/bycatch.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
What are the arguments ssl.verifyhost=F and ssl.verifypeer=F doing? To be quite honest, I don’t really know. But if I’m having trouble reading from a URL I try specifying these arguments and changing one or both to FALSE almost always circumvents whatever error I’m getting.
This grabs the content residing at the specified URL, but doesn’t return a data.frame object. It has simply put the URL’s content into a string.
class(myfile)
[1] "character"
So how to get this into a data.frame object? We’ll use textConnection() to open a “connection” with the string, much like you would open a connection with a file on your hard drive in order to read it. Then we’ll have read.csv() (or you could use read.table() or fread() or similar) to read the string object like a text file and create a data.frame object.
mydat <- read.csv(textConnection(myfile), header=T)
head(mydat)
   Season  Area Gear.Type  Time Tows Bycatch
1 1989-90 North    Bottom   Day   48       0
2 1989-90 North    Bottom Night    6       0
3 1989-90 North Mid-Water Night    1       0
4 1989-90 South    Bottom   Day  139       0
5 1989-90 South Mid-Water   Day    6       0
6 1989-90 South    Bottom Night    6       0
And there you have it. The data from the URL is now in a data.frame and ready to go.
Aside: read.csv() is just a vesion of read.table() with argument defaults such as sep = "," that make sense for reading .csv files.

A Use Case

Let’s pretend I want to automate something having to do with weather in Chicago. Maybe it’s a knitr document that I have scheduled to re-knit every night on my server. Every time the script re-runs, it should somehow take into account recent weather in Chicago. Weather Undergroundoffers historic (and an hour ago counts as “historic”) hourly weather data for many different locations. Many of these locations are airports, which for obvious reasons, have several meteorological sensors on site. On the Weather Underground page you can select a location and a date and see hourly weather for that calendar day. At the bottom of the page, you can click “Comma Delimited File” to see the data in comma delimited format - perfect for reading into R.
I see that the four letter airport code for Chicago is “KMDW” and after clicking through a few of these URLs, I see the stuff after “DailyHistory.html” doesn’t change. So if I know the date, I can construct the URL where the hourly Chicago airport wether for that date can be found in .csv format.
First, I define the beginning and end of the URL, which never change.
baseURL <- 'http://www.wunderground.com/history/airport/KMDW'
suffixURL <- 'DailyHistory.html?HideSpecis=1&format=1'
There is opportunity here to generalize this for many locations if one simply maps the four letter codes to other locations of interest usingswitch() or similar.
Then I ask the system for todays date and from it produce a string having format year/month/day.
Date <- Sys.Date()
datestring <- format(Date, '%Y/%m/%d')
Then I piece all of these strings together to get a URL which will lead to a .csv file of today’s weather in Chicago.
url2fetch <- paste(baseURL, datestring, suffixURL, sep='/')
Finally I grab the content of the webpage at that URL using the RCurl method described above. I choose getURL() instead of fread() for good reason; I’ll need to do some find-and-replace to clean up some html artifacts in the data and that is more efficient to do on one big string rather than on a bunch of individual values in a data.frame.
url_content <- getURL(url2fetch)
Now I have the content of the page in a string and I want to read that string into a data.frame object, but every line of the data ends with an html newline (“<br />”) and a text newline (“\n”). read.csv() will recognize the “\n” as a signal to start a new row of the data.frame, but the “<br />” isn’t recognized and will be appended to the value in the last column of every row. So let’s take care of this before read.csv() ever gets involved. I’ll do a simple find-and-replace where I find “<br />” and replace it with an empty string (""), aka nothing. This is the regex way of find-and-delete.
url_content <- gsub('<br />', '', url_content)
Finally I can “read” the data into a data.frame object with the help of read.csv() and textConnection().
weather_data <- read.csv(textConnection(url_content))
head(weather_data)
   TimeCST TemperatureF Dew.PointF Humidity Sea.Level.PressureIn
1 12:22 AM         21.9       17.1       82                30.02
2 12:53 AM         21.9       16.0       78                30.07
3  1:53 AM         21.9       15.1       75                30.09
4  2:24 AM         21.0       14.0       74                30.04
5  2:39 AM         21.0       14.0       74                30.04
6  2:53 AM         21.0       15.1       78                30.09
  VisibilityMPH Wind.Direction Wind.SpeedMPH Gust.SpeedMPH PrecipitationIn
1           1.0            NNE          13.8             -            0.01
2           1.0            NNE          15.0             -            0.01
3           4.0            NNE          11.5             -            0.00
4           2.5            NNE          16.1             -            0.00
5           1.5            NNE          12.7             -            0.00
6           1.8            NNE          12.7             -            0.00
  Events Conditions WindDirDegrees             DateUTC
1   Snow       Snow             30 2015-02-26 06:22:00
2   Snow Light Snow             30 2015-02-26 06:53:00
3   Snow Light Snow             30 2015-02-26 07:53:00
4   Snow Light Snow             30 2015-02-26 08:24:00
5   Snow Light Snow             30 2015-02-26 08:39:00
6   Snow Light Snow             30 2015-02-26 08:53:00
Glorious.
2

View comments

Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.

Understanding data.table Rolling Joins

Robert Norberg

June 5, 2016

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
2

A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance

Robert Norberg

Monday, April 06, 2015

Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
3

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
2

Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
2

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.

Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
1

A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.

In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.

For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.
My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.