King's College London - COST SYSGENET - Introduction to R 2013

Bioconductor - a use case - Tools to handle genomic data

Dr Matthew Davis, Enright laboratory

European Molecular Biology Laboratory, European Bioinformatics Institute


In this practical we will be using packages from Bioconductor. Bioconductor provides a repository for packages designed to handle large quantities of genomic data. The packages are used within R. The Bioconductor website can be found here: Bioconductor currently holds a huge number of packages along with documentation and reference manuals and is definitely worth exploring.

Accessing genomic annotation data using biomaRt

When performing genomewide analysis there are many ways to access annotation data. The biomaRt package strikes a balance between ease of use and flexibility. It also allows you to access databases and gather genomic information even if R is your only programming language.

First load the biomaRt library.


Next display all the web services accessible via biomaRt.


We will use the Ensembl database today.

database <- useMart("ensembl")

Make a list of the datasets available, we will use the mouse and append this information to our 'Mart' object.

database <- useDataset("mmusculus_gene_ensembl",mart=database)

The getBM function allows the collection of data from the database. We supply this function with the Mart object we are interested in and 3 other options.

Attributes are the pieces of information we are interested in collecting from the database.


Filters specify any criteria by which we wish to restrict our results.


Values are the specific values applied to the filters, eg. gene names, specific chromosomes etc. The examples below will make this clearer...

First we will retrieve the genomic coordinates of a set of miRNA hairpins that we will use later.

miRsequences <- c("mmu-mir-291b","mmu-mir-292","mmu-mir-293","mmu-mir-294","mmu-mir-295")
miRNAs <- getBM(
      attributes = c("chromosome_name", "start_position", "end_position", "strand","mirbase_id","ensembl_gene_id"), 
      filters = "mirbase_id", values = miRsequences, mart = database)
   chromosome_name start_position end_position strand   mirbase_id    ensembl_gene_id
1                7        3219482      3219560      1 mmu-mir-291b ENSMUSG00000078032
2     MG4151_PATCH        3218156      3218234      1 mmu-mir-291b ENSMUSG00000097568
3                7        3219189      3219270      1  mmu-mir-292 ENSMUSG00000078041
4     MG4151_PATCH        3217863      3217944      1  mmu-mir-292 ENSMUSG00000097782
5                7        3220343      3220422      1  mmu-mir-293 ENSMUSG00000078035
6     MG4151_PATCH        3219017      3219096      1  mmu-mir-293 ENSMUSG00000097688
7                7        3220641      3220724      1  mmu-mir-294 ENSMUSG00000077903
8     MG4151_PATCH        3219315      3219398      1  mmu-mir-294 ENSMUSG00000097034
9                7        3220773      3220841      1  mmu-mir-295 ENSMUSG00000077886
10    MG4151_PATCH        3219447      3219515      1  mmu-mir-295 ENSMUSG00000097400

We'll save this data for use in the second session.

save(miRNAs, file="miRNA_cluster.RData")

It is also possible to find features associated with a specified genomic position. For example, if you have a SNP of interest and know its genomic coordinates you can look for overlapping features.

thefeatures <- getBM(
      attributes = c("chromosome_name", "start_position", "end_position", "strand","external_gene_id","ensembl_gene_id"), 
      filters = c("chromosomal_region"), values = c("16:18255882:18255882"), mart = database)
   chromosome_name start_position end_position strand external_gene_id    ensembl_gene_id
1               16       18253948     18289249     -1            Dgcr8 ENSMUSG00000022718

Finally, it is also possible to derive sequences from the database.

                                                                             gene_exon   mirbase_id

These are just a few examples of how you can use biomaRt to retrieve data for use in your work. However, biomaRt is a very versatile tool and can be applied to many situations. The manual/documentation for this package is very descriptive and contains loads of examples both to access Ensembl and other databases.


If, for some reason, you are unable to access these vignettes on your system, do not despair! All of the documentation is also maintained online:


1. You are interested in the effect of various conditions on the G2/M mitotic cell cycle transition in the mouse (mmusculus). A colleague returns from a conference and has notes from a lecture they feel will interest you. The notes contain the MGI symbols of 5 genes (Tpd52l1, Birc5, Rab6a, Tmem204, Wnt10b). Identify which of these genes is directly associated with the cell cycle. (Hint: The GO ID for the G2/M cell cycle transition is GO:0000086). Check the vignette/manual to find out how to use multiple filters. There are loads of examples here - take a look at 4.3 Task 3)

2. Your boss has published hundreds of papers on Birc5 and is intrigued. After chatting to a collaborator, he suggests searching the 3utr of this gene for protein binding motifs. Retrieve the sequence for all of th 3' UTR sequences of Birc5 so that you can use it in the searches. Hint: Check the getSequence manual for all of the wonderful things it can do.

3. The search for motifs was not fruitful but you are not disheartened. A friend has recently generated a high quality genomic map of epigenetic markers from across the genome. She offers to see if any regions of interest are overlapping the G2/M associated genes. Find the exonic coordinates for these genes. Hint: If you try to do this as demonstrated in the work session it may well break. Try googling the error message and see if you can fix the problem. Alternatively take a look at 7.3 in the biomaRt vignette.

4. Your friend offers to share some very preliminary data with you. She has also generated this map for a series of G2/M phase stalled, mutagenised rat cell lines (rnorvegicus_gene_ensembl). She identifies a region that appears to be consistently differentially effected on chromosome 5 (300000 to 600000) for which the epigenetic markers appear to be altered. Find any genes that overlap this region.

Manipulating genomic coordinates with GenomicRanges

As we saw above it is possible to use biomaRt to identify genomic features that are associated with a particular set of genomic coordinates. However, when handling data derived by next-generation technologies these genomewide comparisons can involve millions of coordinate pairs. The GenomicRanges package helps to simplify what may at first seem like a daunting task.

The data sets

We require a large set of genomic coordinates in order to demonstrate the versatility of the GenomicRanges package. To this end, we will use some small RNA sequencing data downloaded from ArrayExpress (ID: E-GEOD-11724, ENA Accession: SRR023847). This data is derived from a small RNA sequencing experiment performed on mouse embryonic stem cells ("Connecting miRNA Genes to the Core Transcriptional Regulatory Circuitry of Embryonic Stem Cells" Marson et al. Cell 134, 521-533). In principle the sequences within this file will correspond to small RNAs. In order to prepare the data for the course I have trimmed and cleaned the sequences to remove adapter sequences added during sample preparation and from those trimmed sequences I have selected those that are expected to correspond to mature miRNAs, based on length. I have then mapped the reads to the mouse genome sequence and selected those reads that map to a small region on chromosome 7. This provides a large set of genomic coordinates for us to use but the principles below can be applied to any other set of genomic coordinate data.

We will be comparing these genomically mapped sequences to the miRNA coordinates we generated earlier in the biomaRt section above. These are miRNAs that are known to be highly expressed in mouse ES cells.

The format of the aligned reads

The read alignments are represented in SAM and BAM formats. SAM format is a tab delineated, human readable format containing the coordinates of the alignments etc. BAM is the equivalent file but optimised for computers. I have converted the read alignments to BAM format for reading into our R session.

Extract of SAM format:

Please be aware that it is beyond the scope of this session to provide a comprehensive guide to interpreting RNA sequencing data and so a few considerations will be skipped (eg. adjusting counts for reads that map to multiple genomic loci). However, if this interests you, hopefully by the end you will feel more happy working with these Bioconductor packages and be comfortable exploring further.

Loading the data into R

First load the GenomicRanges library


Next load the miRNA coordinates we generated earlier (If you have the same R session open still there is no need to do this). Note that the genomic strand information provided by Ensembl has to be converted to match our alignment file for this analysis.


miRNAs[miRNAs$strand == 1,"strand"] <- "+"
miRNAs[miRNAs$strand == -1,"strand"] <- "-"
miRNAs[miRNAs$strand == 0,"strand"] <- "*"

We also need to remove the chromosome patches for today's analysis.

miRNAs <- miRNAs[miRNAs$chromosome_name == "7",]

Convert the miRNA coordinates into a GRanges object.

miRNAsGR <- GRanges(
   seqnames   = Rle(factor(miRNAs$chromosome_name)),
   ranges     = IRanges(start= miRNAs$start_position, end= miRNAs$end_position),
   strand     = Rle(factor(miRNAs$strand)),
   seqlengths = c("7"=145441459),
   name       = miRNAs$mirbase_id

GRanges with 5 ranges and 1 metadata column:
      seqnames             ranges strand |         name
  [1]        7 [3219482, 3219560]      + | mmu-mir-291b
  [2]        7 [3219189, 3219270]      + |  mmu-mir-292
  [3]        7 [3220343, 3220422]      + |  mmu-mir-293
  [4]        7 [3220641, 3220724]      + |  mmu-mir-294
  [5]        7 [3220773, 3220841]      + |  mmu-mir-295

To load the alignment data we will use a further package; Rsamtools. Functions from this package allow us to interpret and load BAM files, translating them into a format compatible with R.


my_interesting_variables <- c("qname","rname","strand","pos","qwidth")
param <- ScanBamParam(what=my_interesting_variables)
aligned_data <- scanBam("chromosome_7_aligned.bam",param=param)[[1]]

alignments  <- data.frame(aligned_data,stringsAsFactors=FALSE)
colnames(alignments) <- c("Read","Seq","Strand","Start","Width")

               Read Seq Strand   Start Width
1   SRR023847.43288   7      + 3218990    21
2  SRR023847.123657   7      + 3218990    20
3  SRR023847.976020   7      + 3218990    21
4 SRR023847.1172564   7      + 3218990    19
5 SRR023847.2152401   7      + 3218990    19
6 SRR023847.2236490   7      + 3218990    20

Again, the data needs to be converted into a GRanges object.

chr7GR  <-   GRanges(
   seqnames   = Rle(factor(alignments$Seq)),
   ranges     = IRanges(start=alignments$Start, width=alignments$Width),
   strand     = Rle(factor(alignments$Strand)),
   seqlengths = c("7"=145441459),
   name       = alignments$Read

GRanges with 234054 ranges and 1 metadata column:
           seqnames             ranges strand   |              name
       [1]        7 [3218990, 3219010]      +   |   SRR023847.43288
       [2]        7 [3218990, 3219009]      +   |  SRR023847.123657
       [3]        7 [3218990, 3219010]      +   |  SRR023847.976020
       [4]        7 [3218990, 3219008]      +   | SRR023847.1172564
       [5]        7 [3218990, 3219008]      +   | SRR023847.2152401
       ...      ...                ...    ... ...               ...

Finding overlaps between sequence reads and known genomic features

Once we have our data represented as GRanges objects they become easy to manipulate using the specialised functions. First we will find which of our sequence reads overlap our miRNA coordinates.

overlaps <- findOverlaps(query= miRNAsGR,subject=chr7GR, minoverlap = 20)
Hits of length 231012
queryLength: 5
subjectLength: 234054
       queryHits subjectHits 
 1              1       67748 
 2              1       67749 
 3              1       67750 
 4              1       67751 
 5              1       67752 
 ...          ...         ... 
 231008         5      233941 
 231009         5      233942 
 231010         5      233943 
 231011         5      233944 
 231012         5      233945 

We can split the Hits object to reveal which reads are associated with each miRNA

miRNAreads <- split(subjectHits(overlaps),queryHits(overlaps))
names(miRNAreads) <- mcols(miRNAsGR)[as.numeric(names(miRNAreads)),"name"]

[1] "list"
[1] 67748 67749 67750 67751 67752 67753

[1] 22 23 25 26 27 28

[1] 75318 75319 75320 75322 75325 75327

[1] 99021 99022 99023 99024 99025 99026

[1] 152354 152355 152356 152357 152358 152359

Finally, we can replace the read indexes with the read names. Although during small RNA-sequencing analysis it is rarely necessary to associate specific reads with a feature, in other circumstances, such as searching for features associated with specific SNPs etc. displaying the results in this form may be useful.

associatedReads <- lapply(miRNAreads, function(x){mcols(chr7GR)[as.numeric(x),"name"]})

[1] "SRR023847.33928"   "SRR023847.69231"   "SRR023847.496089"  "SRR023847.1073815" "SRR023847.1339362"
[6] "SRR023847.1445575"

[1] "SRR023847.3434161" "SRR023847.52340"   "SRR023847.331166"  "SRR023847.617030"  "SRR023847.646330" 
[6] "SRR023847.984093" 

[1] "SRR023847.6592"  "SRR023847.8644"  "SRR023847.9256"  "SRR023847.10743" "SRR023847.20458"
[6] "SRR023847.25731"

[1] "SRR023847.479662"  "SRR023847.846685"  "SRR023847.867067"  "SRR023847.1087858" "SRR023847.1190599"
[6] "SRR023847.1340549"

[1] "SRR023847.1766634" "SRR023847.2070305" "SRR023847.16483"   "SRR023847.48427"   "SRR023847.65172"  
[6] "SRR023847.80442"  

Measuring expression

When analysing small RNA sequencing data often the goal is to produce an expression matrix for differential expression analysis. In these situations the number of reads mapping to a miRNA locus is often taken as a surrogate for miRNA expression. To this end the countOverlaps function can be used in a fashion similar to findOverlaps to count the number of reads representing each genomic feature. In practice, further considerations are required, such as how to treat multi-mapping reads, however the principles are the same.

miRNACounts <- countOverlaps(query= miRNAsGR,subject=chr7GR, minoverlap = 18)
names(miRNACounts) <- mcols(miRNAsGR)$name
mmu-mir-291b  mmu-mir-292  mmu-mir-293  mmu-mir-294  mmu-mir-295 
        7553        67722        23687        53263        81700 

Visualise sequence coverage of a known feature

Often when analysing next generation sequencing data it is useful to be able to visualise a locus to see exactly which region of a gene is providing the source for the small RNAs. The coverage function reports the number of overlapping ranges that cover each base of our underlying sequence.

favourite <- c("mmu-mir-292")
miRStart <- start(miRNAsGR[mcols(miRNAsGR)$name==favourite])
miREnd <- end(miRNAsGR[mcols(miRNAsGR)$name==favourite])
my_cov <- coverage(chr7GR)[["7"]]
integer-Rle of length 145441459 with 181 runs
  Lengths:   3218989        19         1         1       168 ...         1         1        17 142220602
  Values :         0        11         7         5         0 ...       789         6         1         0

We can plot this as a pretty picture.

plot(as.vector(my_cov)[miRStart:miREnd], type='l', col='red', 
   xaxt="n", ylab = "Coverage",xlab="Chromosomal coordinates", main=favourite)
axis(side = 1, at=, to=(miREnd-miRStart+1), by=10), 
   labels = miRStart, to= miREnd, by=10), cex.axis = 0.5, las=2)

This coverage profile is typical for an expressed miRNA loci. The profile is formed as a consequence of the process by which mature miRNAs are released from a precursor hairpin.