For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly. Here's our first bit of work in mySQL, solved in R:



BIOL 525 Lecture 3: In class work

Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or

From the Downloads folder in the Resources section of the wiki:

1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)

2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)

I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the read.table() command. This is the simplest way of reading data into R. Note the argument sep='\t' in the read.table() command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.
attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt", 
    sep = "\t")
mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId", 
    "transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId", 
    "havanaTranscriptId", "ccdsId", "level", "transcriptClass")
names(attributes) <- mynames

genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt", 
    sep = "\t")
mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart", 
    "cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat", 
    "cdsEndStat", "exonFrames")
names(genes) <- mynames1

Answer the following questions:

1) How many entries are in the Gencode gene table?

dim(genes)[1]
## [1] 167536

2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846))
## [1] 142

3) How many of these entries are for genes on the negative strand.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-"))
## [1] 90

4) How many of these negative strand genes have more than 15 exons.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-" & exonCount > 15))
## [1] 23

5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.

length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
    41642846 & strand == "-" & exonCount > 15)$name2))
## [1] 3

6) How many entries (transcripts) are there for BRCA1.

nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 & 
    strand == "-" & exonCount > 15 & name2 == "BRCA1"))
## [1] 15

** Bonus **

7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.

Hint: You can ORDER BY multiple fields.

my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 
    41642846 & strand == "-" & exonCount > 15)$name2)
unique(subset(attributes, geneName %in% my3genes)$geneStatus)
## [1] KNOWN
## Levels: KNOWN NOVEL PUTATIVE
0

Add a comment

Purpose

The caret package includes a function for data splitting, createTimeSlices(), that creates data partitions using a fixed or growing window. The main arguments to this function, initialWindow and horizon, allow the user to create training/validation resamples consisting of contiguous observations with the validation set always consisting of n = horizon rows. If fixedWindow = TRUE, the training set always has n =initialWindow rows.

Understanding data.table Rolling Joins

Robert Norberg

June 5, 2016

Introduction

Rolling joins in data.table are incredibly useful, but not that well documented. I wrote this to help myself figure out how to use them and perhaps it can help you too.

library(data.table)

The Setup

Imagine we have an eCommerce website that uses a third party (like PayPal) to handle payments.
2

A Custom caret C5.0 Model for 2-Class Classification Problems with Class Imbalance

Robert Norberg

Monday, April 06, 2015

Introduction

In this post I share a custom model tuning procedure for optimizing the probability threshold for class imbalanced data. This is done within the excellent caret package framework and is akin to the example on the package website, but the example shows an extension of therandom forest (or rf) method while I present an extension to the C5.0 method.
3

Getting Data From One Online Source

Robert Norberg

Hello world. It’s been a long time since I posted anything here on my blog. I’ve been busy getting my Masters degree in statistical computing and I haven’t had much free time to blog. But I’ve writing R code as much as ever. Now, with graduation approaching, I’m job hunting and I thought it would be good to put together a few things to show potential employers.
2

Generating Tables Using Pander, knitr, and Rmarkdown

I use a pretty common workflow (I think) for producing reports on a day to day basis. I write them in rmarkdown using RStudio, knit them into .html and .md documents using knitr, then convert the resulting .md file to a .docx file using pander, which is really just a way of communicating with Pandoc via my R terminal.
2

R vs. Perl/mySQL - an applied genomics showdown

Recently I was given an assignment for a class I'm taking that got me thinking about speed in R. This isn't something I'm usually concerned with, but the first time I tried to run my solution (ussing plyr's ddply() it was going to take all night to compute.

Stop Sign Sampling Project

Post 1: Planning Phase

Welcome back to the blog y'all. It's been a while since my last post and I've got some fun stuff for you. I'm currently enrooled in a survey sampling methodology class and we've been given a semester-long project, which I will of course be doing entirely in R. My group's assignment is to estimate the proportion of cars that actually stop at a stop sign in Chapel Hill.
1

A while ago I was asked to give a presentation at my job about using R to create statistical graphics. I had also just read some reviews of the Slidify package in R and I thought it would be extremely appropriate to create my presentation about visualization in R, in R. So I set about breaking in the Slidify package and I've got to give a huge shout out to Ramnath Vaidyanathan who created this package.

In class today we were discussing several types of survey sampling and we split into groups and did a little investigation. We were given a page of 100 rectangles with varying areas and took 3 samples of size 10. Our first was a convenience sample. We just picked a group of 10 rectangles adjacent to each other and counted their area. Next, we took a simple random sample (SRS), numbering the rectangles 1 through 100 and choosing 10 with a random number generator.

For a class I'm taking this semester on genomics we're dealing with some pretty large data and for this reason we're learning to use mySQL. I decided to be a geek and do the assignments in R as well to demonstrate the ability of R to handle pretty large data sets quickly.

This post is a huge jump from the last two - this is not for beginners!! But if you've ever considered building a GUI in R, looked at some of the online documentation, gotten scared, and decided not to, read this!!! Ok here goes.

Dorian Auto GUI

Setup: I built this for a school project. The basic problem setup is from a class I'm taking on operations research using spreadsheets.
1

Classes and objects in R

Welcome back! In this blog post I'm going to try to tackle the concept of objects in R. R is said to be an “object oriented” language. I touched on this in my last post when we discussed the concatenate function c() and I'll go a bit beyond that this time. Speaking of the c() function, I'll begin this post by divulging the answer to the Challenge from last time.
1

Intro to R

Hello, and welcome to my blog. The goal of this is to introduce people to R in a way that is easy to grasp. It's command line interface can be pretty intimidating, so hopefully this can help ease you into it. Chances are, if you're reading this, you're a close friend of mine (I don't have much reputation on the internet yet), but no matter who you are, I welcome comments, questions, suggestions, etc.
My Blog List
My Blog List
Blog Archive
About Me
About Me
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.