For a class I'm taking
this semester on genomics we're dealing with some pretty large data and for
this reason we're learning to use mySQL. I decided to be a geek and do the
assignments in R as well to demonstrate the ability of R to handle pretty large
data sets quickly. Here's our first bit of work in mySQL, solved in R:
BIOL 525 Lecture 3: In class work
Download, create, and import the following tables either from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ or
From the Downloads folder in the Resources section of the wiki:
1) GENCODE gene table: wgEncodeGencodeCompV12.sql, wgEncodeGencodeCompV12txt.gz (or wgEncodeGencodeCompV12txt)
2) GENCODE gene attributes table: wgEncodeGencodeAttrsV12.sql, wgEncodeGencodeAttrsV12.txt.gz (or wgEncodeGencodeAttrsV12.txt)
I decided to download the .txt files from the class website, save them in my designated folder for this class and then infile them to R using the
read.table()
command. This is the simplest way of reading data into R. Note the argument sep='\t'
in the read.table()
command. This lets R know that the file is a tab-delimited file and allows it to be read in correctly. After reading in the data I manually named the columns using the information given in the .sql files on the class webpage.attributes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeAttrsV12.txt",
sep = "\t")
mynames <- c("geneId", "geneName", "geneType", "geneStatus", "transcriptId",
"transcriptName", "transcriptType", "transcriptStatus", "havanaGeneId",
"havanaTranscriptId", "ccdsId", "level", "transcriptClass")
names(attributes) <- mynames
genes <- read.table("C:/Users/rnorberg/Desktop/Classes/BIOL 525/wgEncodeGencodeCompV12.txt",
sep = "\t")
mynames1 <- c("bin", "name", "chrom", "strand", "txStart", "txEnd", "cdsStart",
"cdsEnd", "exonCount", "exonStarts", "exonEnds", "score", "name2", "cdsStartStat",
"cdsEndStat", "exonFrames")
names(genes) <- mynames1
Answer the following questions:
1) How many entries are in the Gencode gene table?
dim(genes)[1]
## [1] 167536
2) How many entries are in the genomic interval chr17:40,830,967-41,642,846.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846))
## [1] 142
3) How many of these entries are for genes on the negative strand.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 &
strand == "-"))
## [1] 90
4) How many of these negative strand genes have more than 15 exons.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 &
strand == "-" & exonCount > 15))
## [1] 23
5) How many uniquely named genes (name2 field) are on the negative strand and have more than 15 exons. List them.
length(unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <=
41642846 & strand == "-" & exonCount > 15)$name2))
## [1] 3
6) How many entries (transcripts) are there for BRCA1.
nrow(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <= 41642846 &
strand == "-" & exonCount > 15 & name2 == "BRCA1"))
## [1] 15
** Bonus **
7) The name2 field in the wgEncodeGencodeCompV7 and the geneName field in wgEncodeGencodeAttrsV7 tables are linked. For the three genes in #5, determine all possible geneStatus from the wgEncodeGencodeAttrsV7 table.
Hint: You can ORDER BY multiple fields.
my3genes <- unique(subset(genes, chrom == "chr17" & txStart >= 40830967 & txEnd <=
41642846 & strand == "-" & exonCount > 15)$name2)
unique(subset(attributes, geneName %in% my3genes)$geneStatus)
## [1] KNOWN
## Levels: KNOWN NOVEL PUTATIVE
Add a comment