Abstract

RNA-seq analysis involves multiple steps from processing raw sequencing data to identifying, organizing, annotating, and reporting differentially expressed genes. bcbio is an open source, community-maintained framework providing automated and scalable RNA-seq methods for identifying gene abundance counts. We have developed bcbioRNASeq, a Bioconductor package that provides ready-to-render templates and wrapper functions to post-process bcbio output data. bcbioRNASeq automates the generation of high-level RNA-seq reports, including identification of differentially expressed genes, functional enrichment analysis and quality control analysis.

Introduction

For a high-level overview of our bcbio RNA-seq analysis pipeline, including detailed explanation of the bcbioRNASeq S4 class definition, first consult our workflow paper published in F1000 Research (Steinbaugh et al. 2018). This vignette is focused on more advanced usage and edge cases that a user may encounter when attempting to load a bcbio dataset and perform downstream quality control analysis.

Note: if you use bcbioRNASeq in published research, please include this citation:

citation("bcbioRNASeq")
## 
## To cite bcbioRNASeq in publications use:
## 
##   Steinbaugh MJ, Pantano L, Kirchner RD, Barrera V, Chapman BA,
##   Piper ME, Mistry M, Khetani RS, Rutherford KD, Hoffman O,
##   Hutchinson JN, Ho Sui SJ. (2018). bcbioRNASeq: R package for
##   bcbio RNA-seq analysis. F1000Research, 6:1976. URL
##   https://f1000research.com/articles/6-1976/v2. DOI
##   10.12688/f1000research.12093.2.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     title = {{bcbioRNASeq}: {R} package for bcbio {RNA-seq} analysis},
##     author = {Michael J. Steinbaugh and Lorena Pantano and Rory D. Kirchner and Victor Barrera and Brad A. Chapman and Mary E. Piper and Meeta Mistry and Radhika S. Khetani and Kayleigh D. Rutherford and Oliver Hofmann and John N. Hutchinson and Shannan J. Ho Sui},
##     journal = {F1000Research},
##     volume = {6},
##     number = {1976},
##     year = {2018},
##     url = {https://f1000research.com/articles/6-1976/v2},
##     doi = {10.12688/f1000research.12093.2},
##   }

Loading bcbio data

The bcbioRNASeq() constructor function is the main interface connecting bcbio output data to interactive use in R. It is highly customizable and supports a number of options for advanced use cases. Consult the documentation available at help(topic = "bcbioRNASeq", package = "bcbioRNASeq") for additional details.

formals("bcbioRNASeq") %>% str()
## Dotted pair list of 16
##  $ uploadDir         : symbol 
##  $ level             : language c("genes", "transcripts")
##  $ caller            : language c("salmon", "kallisto", "sailfish", "star", "hisat2")
##  $ interestingGroups : chr "sampleName"
##  $ samples           : NULL
##  $ censorSamples     : NULL
##  $ sampleMetadataFile: NULL
##  $ organism          : NULL
##  $ genomeBuild       : NULL
##  $ ensemblRelease    : NULL
##  $ gffFile           : NULL
##  $ transgeneNames    : NULL
##  $ spikeNames        : NULL
##  $ vst               : logi TRUE
##  $ rlog              : logi FALSE
##  $ ...               : symbol

Upload directory

We have designed the constructor to work as simply as possible by default. The only required argument is uploadDir, the path to the bcbio final upload directory specified with upload: in the YAML configuration. Refer to the bcbio configuration documentation for detailed information on how to set up a bcbio run, which is outside the scope of this vignette.

For example, let’s load up the example bcbio dataset stored internally in the package.

upload_dir <- system.file("extdata/bcbio", package = "bcbioRNASeq")
stopifnot(file.exists(upload_dir))
print(upload_dir)
## [1] "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"

bcbio outputs RNA-seq data in a standardized directory structure, which is described in detail in our workflow paper.

dir(path = upload_dir, full.names = FALSE, recursive = TRUE)
##  [1] "2017-05-23_rnaseq/bcbio-nextgen-commands.log"
##  [2] "2017-05-23_rnaseq/bcbio-nextgen.log"         
##  [3] "2017-05-23_rnaseq/combined.counts"           
##  [4] "2017-05-23_rnaseq/data_versions.csv"         
##  [5] "2017-05-23_rnaseq/programs.txt"              
##  [6] "2017-05-23_rnaseq/project-summary.yaml"      
##  [7] "2017-05-23_rnaseq/tx2gene.csv"               
##  [8] "group1_1/salmon/quant.sf"                    
##  [9] "group1_2/salmon/quant.sf"                    
## [10] "group2_1/salmon/quant.sf"                    
## [11] "group2_2/salmon/quant.sf"                    
## [12] "sample_metadata.csv"

Counts level

By default, bcbioRNASeq imports counts at gene level, which are required for standard differential expression analysis (level = "genes"). For pseudo-aligned counts (e.g. Salmon, Kallisto, Sailfish) (Bray et al. 2016; Patro, Mount, and Kingsford 2014; Patro et al. 2017), tximport (Soneson, Love, and Robinson 2016) is used internally to aggregate transcript-level counts to gene-level counts, and generates length-scaled transcripts per million (TPM) values. For aligned counts processed with featureCounts (Liao, Smyth, and Shi 2014) (e.g. STAR, HISAT2) (Dobin et al. 2013; Dobin and Gingeras 2016; Kim, Langmead, and Salzberg 2015), these values are already returned at gene level, and therefore not handled by tximport. Once the gene-level counts are imported during the bcbioRNASeq() call, the DESeq2 package (Love, Huber, and Anders 2014) is then used to generate an internal DESeqDataSet from which we derive normalized and variance-stabilized counts.

bcb <- bcbioRNASeq(uploadDir = upload_dir, level = "genes")
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 4 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"

Alternatively, if you want to perform transcript-aware analysis, such as differential exon usage or splicing analysis, transcript-level counts can be obtained using level = "transcripts". Note that when counts are loaded at transcript level, TPMs are generated with tximport internally, but no additional normalizations or transformations normally calculated for gene-level counts with DESeq2 are generated.

bcb <- bcbioRNASeq(uploadDir = upload_dir, level = "transcripts")
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 996 4 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(3): counts tpm length
## rownames(996): ENSMUST00000193812 ENSMUST00000082908 ...
##   ENSMUST00000192656 ENSMUST00000193983
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "transcripts"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"

Expression callers

Since bcbio is flexible and supports a number of expression callers, we have provided advanced options in the bcbioRNASeq() constructor to support a variety of workflows using the caller argument.

Salmon, Kallisto, and Sailfish counts are supported at either gene or transcript level. Internally, these are loaded using tximport.

salmon_genes <- bcbioRNASeq(uploadDir = upload_dir, caller = "salmon")
print(salmon_genes)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 4 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"
assayNames(salmon_genes)
## [1] "counts"     "tpm"        "length"     "normalized" "vst"
salmon_tx <- bcbioRNASeq(uploadDir = upload_dir, level = "transcripts")
print(salmon_tx)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 996 4 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(3): counts tpm length
## rownames(996): ENSMUST00000193812 ENSMUST00000082908 ...
##   ENSMUST00000192656 ENSMUST00000193983
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "transcripts"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"
assayNames(salmon_tx)
## [1] "counts" "tpm"    "length"

STAR and HISAT2 aligned counts processed with featureCounts are also supported, but only at gene level.

star <- bcbioRNASeq(uploadDir = upload_dir, caller = "star")
print(star)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 999 4 
## metadata(27): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(3): counts normalized vst
## rownames(999): ENSMUSG00000102693 ENSMUSG00000064842 ...
##   ENSMUSG00000040865 ENSMUSG00000104991
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "star"
## Organism: character(0)
## Interesting Groups: "sampleName"
assayNames(star)
## [1] "counts"     "normalized" "vst"

Sample selection and metadata

If you’d like to load up only a subset of samples, this can be done easily using the samples argument. Note that the character vector declared here must match the description column specified in the sample metadata. For example, let’s create a bcbioRNASeq object containing only the group1_1 and group1_2 samples.

bcb <- bcbioRNASeq(
    uploadDir = upload_dir,
    samples = c("group1_1", "group1_2")
)
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 2 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(0):
## colnames(2): group1_1 group1_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"
sampleNames(bcb)
##   group1_1   group1_2 
## "group1_1" "group1_2"

Conversely, if you’re working with a large dataset and you simply want to drop a few samples, this can be accomplished with the censorSamples argument. Note that the samples argument takes priority if both are declared. For example, let’s drop the group2_2 sample from our minimal dataset.

bcb <- bcbioRNASeq(
    uploadDir = upload_dir,
    censorSamples = "group2_2"
)
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 3 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(0):
## colnames(3): group1_1 group1_2 group2_1
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"
sampleNames(bcb)
##   group1_1   group1_2   group2_1 
## "group1_1" "group1_2" "group2_1"

If you’re working with a bcbio run that has incorrect or outdated metadata, the simplest way to fix this issue is to pass in new metadata from an external spreadsheet (CSV or Excel) using the sampleMetadataFile argument. Note that this can also be used to subset the bcbio dataset, similar to the samples argument (see above), based on the rows that are included in the spreadsheet.

sample_metadata_file <- file.path(upload_dir, "sample_metadata.csv")
stopifnot(file.exists(sample_metadata_file))
readSampleData(sample_metadata_file)
## Reading sample_metadata.csv
## Demultiplexed samples detected
##          sampleName             fileName description group
## group1_1   group1_1 group1_1_R1.fastq.gz    group1_1  ctrl
## group1_2   group1_2 group1_2_R1.fastq.gz    group1_2  ctrl
## group2_1   group2_1 group2_1_R1.fastq.gz    group2_1    ko
## group2_2   group2_2 group2_2_R1.fastq.gz    group2_2    ko
bcb <- bcbioRNASeq(
    uploadDir = upload_dir,
    sampleMetadataFile = sample_metadata_file
)
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 4 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(22): sampleName fileName ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"
## Metadata File: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio/sample_metadata.csv"
sampleData(bcb, clean = TRUE)
## DataFrame with 4 rows and 2 columns
##          sampleName    group
##            <factor> <factor>
## group1_1   group1_1     ctrl
## group1_2   group1_2     ctrl
## group2_1   group2_1       ko
## group2_2   group2_2       ko

Genome annotations

When analyzing a dataset against a well-annotated genome, we recommend importing the corresponding metadata using AnnotationHub and ensembldb. This functionality is natively supported in the bcbioRNASeq() constructor with using the organism, ensemblRelease, and genomeBuild arguments. This will return a GRanges object using the GenomicRanges package (Lawrence et al. 2013), which contains coordinates and rich metadata for each gene or transcript. These annotations are accessible with the rowRanges() and rowData() functions defined in the SummarizedExperiment package (Huber et al. 2015).

For example, with our internal bcbio dataset, we’re analyzing counts generated against the Ensembl Mus musculus GRCm38 genome build (release 87). These parameters can be defined in the object load call to ensure that the annotations match up exactly with the genome used.

bcb <- bcbioRNASeq(
    uploadDir = upload_dir,
    level = "genes",
    organism = "Mus musculus",
    genomeBuild = "GRCm38",
    ensemblRelease = 87
)
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 4 
## metadata(29): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(7): broadClass description ... geneName
##   seqCoordSystem
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: "Mus musculus"
## Interesting Groups: "sampleName"
## AnnotationHub: "AH53222"
## Ensembl Release: "87"
## Genome Build: "GRCm38"
metadata(bcb)$rowRangesMetadata
## # A tibble: 13 x 2
##    name               value                              
##    <chr>              <chr>                              
##  1 id                 AH53222                            
##  2 Db type            EnsDb                              
##  3 Type of Gene ID    Ensembl Gene ID                    
##  4 Supporting package ensembldb                          
##  5 Db created by      ensembldb package from Bioconductor
##  6 script_version     0.3.1                              
##  7 Creation time      Fri Jun  9 08:40:26 2017           
##  8 ensembl_version    87                                 
##  9 ensembl_host       localhost                          
## 10 Organism           mus_musculus                       
## 11 taxonomy_id        10090                              
## 12 genome_build       GRCm38                             
## 13 DBSCHEMAVERSION    2.1
rowRanges(bcb) %>% as.data.frame() %>% glimpse()
## Observations: 502
## Variables: 12
## $ seqnames       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ start          <int> 4909576, 33719887, 13139105, 12692277, 18115191...
## $ end            <int> 5070285, 33742564, 13374083, 12861192, 18145902...
## $ width          <int> 160710, 22678, 234979, 168916, 30712, 39300, 18...
## $ strand         <fct> -, +, -, +, -, +, +, +, -, -, +, +, +, -, +, -,...
## $ broadClass     <fct> coding, coding, coding, coding, coding, coding,...
## $ description    <fct> regulator of G-protein signaling 20 [Source:MGI...
## $ entrezID       <I(list)>  58175,  19335,  17978, 240725,  78081,  83...
## $ geneBiotype    <fct> protein_coding, protein_coding, protein_coding,...
## $ geneID         <chr> "ENSMUSG00000002459", "ENSMUSG00000004768", "EN...
## $ geneName       <fct> Rgs20, Rab23, Ncoa2, Sulf1, Crisp4, Crispld1, G...
## $ seqCoordSystem <fct> chromosome, chromosome, chromosome, chromosome,...

Alternatively, transcript-level annotations can also be obtained automatically using this method.

bcb <- bcbioRNASeq(
    uploadDir = upload_dir,
    level = "transcripts",
    organism = "Mus musculus",
    genomeBuild = "GRCm38",
    ensemblRelease = 87
)
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 996 4 
## metadata(29): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(3): counts tpm length
## rownames(996): ENSMUST00000193812 ENSMUST00000082908 ...
##   ENSMUST00000192656 ENSMUST00000193983
## rowData names(13): broadClass description ... transcriptName
##   transcriptSupportLevel
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "transcripts"
## Caller: "salmon"
## Organism: "Mus musculus"
## Interesting Groups: "sampleName"
## AnnotationHub: "AH53222"
## Ensembl Release: "87"
## Genome Build: "GRCm38"
rowRanges(bcb) %>% as.data.frame() %>% glimpse()
## Observations: 996
## Variables: 18
## $ seqnames               <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ start                  <int> 3073253, 3102016, 3205901, 3206523, 321...
## $ end                    <int> 3074322, 3102125, 3216344, 3215632, 367...
## $ width                  <int> 1070, 110, 10444, 9110, 457017, 480, 28...
## $ strand                 <fct> +, +, -, -, -, +, -, -, -, +, -, +, -, ...
## $ broadClass             <fct> other, small, other, other, coding, pse...
## $ description            <fct> RIKEN cDNA 4933401J01 gene [Source:MGI ...
## $ entrezID               <I(list)>     NA,     NA, 497097, 497097, 497...
## $ geneBiotype            <fct> TEC, snRNA, protein_coding, protein_cod...
## $ geneID                 <fct> ENSMUSG00000102693, ENSMUSG00000064842,...
## $ geneName               <fct> 4933401J01Rik, Gm26206, Xkr4, Xkr4, Xkr...
## $ seqCoordSystem         <fct> chromosome, chromosome, chromosome, chr...
## $ transcriptBiotype      <fct> TEC, snRNA, processed_transcript, proce...
## $ transcriptCdsSeqEnd    <int> NA, NA, NA, NA, 3671348, NA, NA, NA, NA...
## $ transcriptCdsSeqStart  <int> NA, NA, NA, NA, 3216022, NA, NA, NA, NA...
## $ transcriptID           <chr> "ENSMUST00000193812", "ENSMUST000000829...
## $ transcriptName         <chr> "ENSMUST00000193812", "ENSMUST000000829...
## $ transcriptSupportLevel <int> NA, NA, 1, 1, 1, NA, NA, NA, NA, 3, NA,...

When working with a dataset generated against a poorly-annotated or non-standard genome, we provide a fallback method for loading gene annotations from a general feature format (GFF) file with the gffFile argument. If possible, we recommend providing a general transfer format (GTF) file, which is identical to GFF version 2. GFFv3 is more complicated and non-standard, but Ensembl GFFv3 files are also supported.

If your dataset contains transgenes (e.g. EGFP, TDTOMATO) or spike-ins (e.g. ERCCs), these features can be defined with the transgeneNames and spikeNames arguments, which will automatically populate the rowRanges() slot with placeholder metadata.

We recommend loading up data per genome in its own bcbioRNASeq object when possible, so that rich metadata can be imported easily. In the edge case where you need to look at multiple genomes simultaneously, set organism = NULL, and bcbioRNASeq will skip the gene annotation acquisition step.

bcb <- bcbioRNASeq(uploadDir = upload_dir, organism = NULL)
print(bcb)
## bcbioRNASeq 0.2.6
## class: RangedSummarizedExperiment 
## dim: 502 4 
## metadata(28): version level ... utilsSessionInfo
##   devtoolsSessionInfo
## assays(5): counts tpm length normalized vst
## rownames(502): ENSMUSG00000002459 ENSMUSG00000004768 ...
##   ENSMUSG00000104523 ENSMUSG00000105982
## rowData names(0):
## colnames(4): group1_1 group1_2 group2_1 group2_2
## colData names(21): sampleName description ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/Users/mike/R/library/3.5-bioc-release/library/bcbioRNASeq/extdata/bcbio"
## Upload Date: 2017-05-23
## R Load Date: 2018-08-23
## Level: "genes"
## Caller: "salmon"
## Organism: character(0)
## Interesting Groups: "sampleName"

Variance stabilization

During the bcbioRNASeq() constructor call, variance stabilizaton of gene-level counts can be calculated automatically, and is recommended. This is performed internally by the DESeq2 package. These transformations will be slotted into assays() as vst and/or rlog matrices. The vst matrix is calculated by DESeq2::varianceStabilizingTransformation() and rlog by DESeq2::rlog(). For large datasets, rlog calculation can take a long time, so we are currently recommending calculation of only vst by default.

bcb <- bcbioRNASeq(
    uploadDir = upload_dir,
    vst = TRUE,
    rlog = TRUE
)
assayNames(bcb)
## [1] "counts"     "tpm"        "length"     "normalized" "vst"       
## [6] "rlog"
counts(bcb, normalized = "vst") %>% summary()
##     group1_1         group1_2         group2_1         group2_2     
##  Min.   : 5.708   Min.   : 5.708   Min.   : 5.708   Min.   : 5.708  
##  1st Qu.: 5.708   1st Qu.: 5.708   1st Qu.: 5.708   1st Qu.: 5.708  
##  Median : 5.708   Median : 5.708   Median : 5.708   Median : 5.708  
##  Mean   : 6.506   Mean   : 6.506   Mean   : 6.518   Mean   : 6.504  
##  3rd Qu.: 6.296   3rd Qu.: 6.260   3rd Qu.: 6.289   3rd Qu.: 6.261  
##  Max.   :17.081   Max.   :17.331   Max.   :17.334   Max.   :17.438
counts(bcb, normalized = "rlog") %>% summary()
##     group1_1         group1_2         group2_1         group2_2     
##  Min.   :-1.726   Min.   :-1.727   Min.   :-1.728   Min.   :-1.727  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.000   Median : 0.000   Median : 0.000   Median : 0.000  
##  Mean   : 1.884   Mean   : 1.887   Mean   : 1.888   Mean   : 1.884  
##  3rd Qu.: 2.985   3rd Qu.: 2.969   3rd Qu.: 2.950   3rd Qu.: 2.979  
##  Max.   :17.130   Max.   :17.322   Max.   :17.325   Max.   :17.405
rm(bcb)

Use case dataset

To demonstrate the functionality and configuration of the package, we have taken an experiment from the Gene Expression Omnibus (GEO) public repository of expression data to use as an example use case. The RNA-seq data is from a study of acute kidney injury in a mouse model (GSE65267) (Craciun et al. 2016). The study aims to identify differentially expressed genes in progressive kidney fibrosis and contains samples from mouse kidneys at several time points (n = 3, per time point) after folic acid treatment. From this dataset, we are using a subset of the samples for our use case: before folic acid treatment, and 1, 3, 7 days after treatment.

For the vignette, we are loading a pre-computed version of the example bcbioRNASeq object used in our workflow paper.

loadRemoteData("https://github.com/hbc/bcbioRNASeq/raw/f1000v2/data/bcb.rda")
print(bcb)
## bcbioRNASeq 0.2.4
## class: RangedSummarizedExperiment 
## dim: 51652 12 
## metadata(30): version level ... devtoolsSessionInfo subset
## assays(6): counts tpm ... rlog vst
## rownames(51652): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000114967 ENSMUSG00000114968
## rowData names(7): geneID geneName ... entrezID broadClass
## colnames(12): control_rep1 control_rep2 ... fa_day7_rep2
##   fa_day7_rep3
## colData names(26): sampleName day ... x5x3Bias xGC
## ================================================================================
## Upload Dir: "/n/data1/cores/bcbio/bcbioRNASeq/F1000v2/GSE65267-merged/final"
## Upload Date: 2018-03-18
## R Load Date: 2018-06-01
## Level: "genes"
## Caller: "salmon"
## Organism: "Mus musculus"
## Interesting Groups: "day"
## AnnotationHub: "AH57770"
## Ensembl Release: "90"
## Genome Build: "GRCm38"

Sample metadata

For reference, let’s take a look at the sample metadata. By comparison, using colData() will return all sample-level metadata, including our quality control metrics generated by bcbio. We recommend instead using sampleData() with the clean = TRUE argument in reports, which only returns factor columns of interest.

sampleData(bcb, clean = TRUE)
## DataFrame with 12 rows and 6 columns
##                sampleName      day replicate   strain   tissue  treatment
##                  <factor> <factor>  <factor> <factor> <factor>   <factor>
## control_rep1 control_rep1        0         1   Balb/c   kidney    control
## control_rep2 control_rep2        0         2   Balb/c   kidney    control
## control_rep3 control_rep3        0         3   Balb/c   kidney    control
## fa_day1_rep1 fa_day1_rep1        1         1   Balb/c   kidney folic acid
## fa_day1_rep2 fa_day1_rep2        1         2   Balb/c   kidney folic acid
## ...                   ...      ...       ...      ...      ...        ...
## fa_day3_rep2 fa_day3_rep2        3         2   Balb/c   kidney folic acid
## fa_day3_rep3 fa_day3_rep3        3         3   Balb/c   kidney folic acid
## fa_day7_rep1 fa_day7_rep1        7         1   Balb/c   kidney folic acid
## fa_day7_rep2 fa_day7_rep2        7         2   Balb/c   kidney folic acid
## fa_day7_rep3 fa_day7_rep3        7         3   Balb/c   kidney folic acid

Interesting groups

Groups of interest to be used for coloring and sample grouping in the quality control plots can be defined in the bcbioRNASeq object using the interestingGroups argument in the bcbioRNASeq() constructor call. Assignment method support with the interestingGroups() function is also provided, which can modify the groups of interest after the object has been created.

The interestingGroups definition defaults to sampleName, which is automatically generated from the bcbio description metadata for demultiplexed bulk RNA-seq samples. In this case, the samples will be grouped and colored uniquely.

Interesting groups must be defined using a character vector and refer to column names defined in the colData() slot of the object. Note that the bcbioRNASeq packages uses lower camel case formatting for column names (e.g. “sampleName”), so the interesting groups should be defined using camel case, not snake (e.g. “sample_name”) or dotted case (e.g. “sample.name”).

This approach was inspired by the DESeq2 package, which uses the argument intgroup in some functions, such as plotPCA() for labeling groups of interest. We took this idea and ran with it for a number of our quality control functions, which are described in detail below.

interestingGroups(bcb) <- c("treatment", "day")
interestingGroups(bcb)
## [1] "treatment" "day"

Quality control

Our workflow paper describes our quality control process in detail. In addition, our Quality Control R Markdown template contains notes detailing each metric. Here were are going into more technical detail regarding how to customize the appearance of the plots. Note that all quality control plotting functions inherit the interestingGroups defined inside the bcbioRNASeq object. You can also change this dynamially for each function call using the interestingGroups argument.

Read counts

Mapping rates

Note that the overall mapping rate is relatively low per sample, while the exonic mapping rate is acceptable. High quality samples should have low intronic mapping rate, and high values are indicative of sample degradation and/or contamination.

rRNA mapping rate

Note that the samples here have a high rRNA mapping rate. This can be indicative of the polyA enrichment or ribo depletion protocol not having removed all ribosomal RNA (rRNA) transcripts. This will reduce the number of biologically meaningful reads in the experiment and is best avoided.

5’->3’ bias

RNA-seq data can have specific biases at either the 5’ or 3’ end of sequenced fragments. It is common to see a small amount of bias, especially if polyA enrichment was performed, or if there is any sample degradation. If a large amount of bias is observed here, be sure to analyze the samples with a Bioanalyzer and check the RIN scores.

Gene distributions

plotGeneSaturation(bcb, label = FALSE)

plotGeneSaturation(bcb, label = TRUE)

## 38142 non-zero genes detected
## Applying trimmed mean of M-values (TMM) normalization

plotCountDensity(
    object = bcb,
    interestingGroups = "sampleName",
    style = "line"
)
## 38142 non-zero genes detected
## Applying trimmed mean of M-values (TMM) normalization

Dispersion

The following plot shows the dispersion by mean of normalized counts. We expect the dispersion to decrease as the mean of normalized counts increases.

## Coercing bcbioRNASeq to DESeqDataSet with DESeq2 1.20.0
## converting counts to integer mode
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 3103 genes
## -- DESeq argument 'minReplicatesForReplace' = 7 
## -- original counts are preserved in counts(dds)
## estimating dispersions
## fitting model and testing

Variance stabilization

These plots show the standard deviation of normalized counts using log2(), varianceStabilizingTransform(), rlog(), and tmm() by rank(mean). Note that all counts shown are log2 scale.

Sample similarity

We can visualize sample similarity with principal component analysis (PCA) and hierarchical clustering of sample correlations. These functions support multiple normalization methods with the normalized argument. We recommend using normalized = "vst" by default.

Let’s plot the PCA using our vst counts, the current recommended default.

plotPCA(bcb, normalized = "vst", label = FALSE)
## Using vst counts
## Plotting PCA using 500 genes

plotPCA(bcb, normalized = "vst", label = TRUE)
## Using vst counts
## Plotting PCA using 500 genes

Let’s visualize the rlog counts for comparison. The performance is similar overall, with some slight differences, but the grouping more or less remains the same.

plotPCA(
    object = bcb,
    normalized = "rlog",
    label = FALSE
)
## Using rlog counts
## Plotting PCA using 500 genes

plotPCA(
    object = bcb,
    normalized = "rlog",
    label = TRUE
)
## Using rlog counts
## Plotting PCA using 500 genes

Generally, the varianceStabilizationTransformation() and rlog() functions provide similar performance, as seen here.

plotCorrelationHeatmap(
    bcb,
    method = "pearson",
    normalized = "vst",
    color = viridis::viridis
)
## Using vst counts

plotCorrelationHeatmap(
    bcb,
    method = "spearman",
    normalized = "vst",
    color = viridis::plasma
)
## Using vst counts

plotCorrelationHeatmap(
    object = bcb,
    method = "pearson",
    normalized = "rlog",
    color = viridis::viridis
)
## Using rlog counts

plotCorrelationHeatmap(
    object = bcb,
    method = "spearman",
    normalized = "rlog",
    color = viridis::plasma
)
## Using rlog counts

Saving counts

The counts() function returns the abundance estimates generated by Salmon. Read counts for each sample in the dataset are aggregated into a matrix, in which columns correspond to samples and rows represent genes. Multiple normalized counts matrices are saved in the bcbioRNASeq object, and are accessible with normalized argument:

  • FALSE: Raw counts (default).
  • TRUE: DESeq2 normalized counts.
  • "tpm": Transcripts per million, calculated by tximport.
  • "tmm": Trimmed mean of M-values normalization method.
  • "vst": Variance stabilization transformation.
  • "rlog": Regularized log transformation.
  • "rle": Relative log expression transformation.

We recommend saving the raw counts, normalized counts, TPMs, and variance-stabilized counts to disk both in binary R Data and CSV formats.

raw <- counts(bcb, normalized = FALSE)
normalized <- counts(bcb, normalized = TRUE)
tpm <- counts(bcb, normalized = "tpm")
vst <- counts(bcb, normalized = "vst")
saveData(raw, normalized, tpm, vst, dir = ".")
writeCounts(raw, normalized, tpm, vst, dir = ".")

Differential expression

bcbioRNASeq integrates with DESeq2, which we recommend for differential expression. To prepare our dataset for analysis, we need to coerce the bcbioRNASeq object to a DESeqDataSet. Use the as() function or refer to the DESeq2 documentation for the DESeqDataSet() constructor.

dds <- as(bcb, "DESeqDataSet")
## Coercing bcbioRNASeq to DESeqDataSet with DESeq2 1.20.0
## converting counts to integer mode
print(dds)
## class: DESeqDataSet 
## dim: 51652 12 
## metadata(30): version level ... devtoolsSessionInfo subset
## assays(6): counts tpm ... rlog vst
## rownames(51652): ENSMUSG00000000001 ENSMUSG00000000003 ...
##   ENSMUSG00000114967 ENSMUSG00000114968
## rowData names(7): geneID geneName ... entrezID broadClass
## colnames(12): control_rep1 control_rep2 ... fa_day7_rep2
##   fa_day7_rep3
## colData names(26): sampleName day ... x5x3Bias xGC

Since both bcbioRNASeq and DESeqDataSet classes extend RangedSummarizedExperiment, internally we coerce the original bcbioRNASeq object to a RangedSummarizedExperiment and then use the DESeqDataSet() constructor, which requires a SummarizedExperiment with integer counts.

Here is the source code responsible for handing off the data to DESeq2.

getMethod(
    f = "coerce",
    signature(
        from = "bcbioRNASeq",
        to = "DESeqDataSet"
    )
)
## Method Definition:
## 
## function (from, to = "DESeqDataSet", strict = TRUE) 
## {
##     validObject(from)
##     if (metadata(from)[["level"]] != "genes") {
##         stop("Gene-level counts are required")
##     }
##     message(paste("Coercing bcbioRNASeq to DESeqDataSet with DESeq2", 
##         packageVersion("DESeq2")))
##     rse <- as(from, "RangedSummarizedExperiment")
##     assay(rse) <- round(assay(rse), digits = 0L)
##     to <- DESeqDataSet(se = rse, design = ~1L)
##     interestingGroups(to) <- interestingGroups(from)
##     validObject(to)
##     to
## }
## <bytecode: 0x7fad478ab0a8>
## <environment: namespace:bcbioRNASeq>
## 
## Signatures:
##         from          to            
## target  "bcbioRNASeq" "DESeqDataSet"
## defined "bcbioRNASeq" "DESeqDataSet"

Now you can perform differential expression analysis with DESeq2. The authors of that package have provided a number of detailed, well documented references online:

We also provide a parameterized R Markdown template for standard DESeq2 differential expression, supporting analysis and visualization of multiple contrasts inside a single report. In our workflow paper, we also describe an example LRT analysis, and advanced edge-case templates are available inside our hbcABC package, which is a current work in progress.

Functional analysis

We provide an R Markdown template for gene set enrichment analysis (GSEA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis (Kanehisa and Goto 2000; Subramanian et al. 2005), which leverages the functionality of the clusterProfiler package (Yu et al. 2012). This workflow is described in detail in our workflow paper, and is outside the scope of this advanced use vignette. For more information on functional analysis, consult the clusterProfiler vignette, which is extremely detailed and thorough.

R session information

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hexbin_1.27.2               bindrcpp_0.2.2             
##  [3] bcbioRNASeq_0.2.6           DESeq2_1.20.0              
##  [5] bcbioBase_0.4.1             basejump_0.7.2             
##  [7] SummarizedExperiment_1.10.1 DelayedArray_0.6.5         
##  [9] BiocParallel_1.14.2         matrixStats_0.54.0         
## [11] Biobase_2.40.0              GenomicRanges_1.32.6       
## [13] GenomeInfoDb_1.16.0         IRanges_2.14.10            
## [15] S4Vectors_0.18.3            BiocGenerics_0.26.0        
## [17] BiocStyle_2.8.2            
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.1.4                    R.utils_2.6.0                
##   [3] tidyselect_0.2.4              RSQLite_2.1.1                
##   [5] AnnotationDbi_1.42.1          htmlwidgets_1.2              
##   [7] grid_3.5.1                    munsell_0.5.0                
##   [9] preprocessCore_1.42.0         codetools_0.2-15             
##  [11] withr_2.1.2                   colorspace_1.3-2             
##  [13] BiocInstaller_1.30.0          knitr_1.20                   
##  [15] rstudioapi_0.7                assertive.base_0.0-7         
##  [17] rdrop2_0.8.1                  labeling_0.3                 
##  [19] tximport_1.8.0                GenomeInfoDbData_1.1.0       
##  [21] mnormt_1.5-5                  bit64_0.9-7                  
##  [23] pheatmap_1.0.10               rprojroot_1.3-2              
##  [25] Matrix.utils_0.9.7            xfun_0.3                     
##  [27] R6_2.2.2                      assertive.sets_0.0-3         
##  [29] locfit_1.5-9.1                AnnotationFilter_1.4.0       
##  [31] bitops_1.0-6                  reshape_0.8.7                
##  [33] assertthat_0.2.0              promises_1.0.1               
##  [35] scales_1.0.0                  nnet_7.3-12                  
##  [37] gtable_0.2.0                  affy_1.58.0                  
##  [39] ensembldb_2.4.1               rlang_0.2.2                  
##  [41] clisymbols_1.2.0              MatrixModels_0.4-1           
##  [43] genefilter_1.62.0             GlobalOptions_0.1.0          
##  [45] splines_3.5.1                 rtracklayer_1.40.5           
##  [47] lazyeval_0.2.1                acepack_1.4.1                
##  [49] checkmate_1.8.5               yaml_2.2.0                   
##  [51] reshape2_1.4.3                GenomicFeatures_1.32.2       
##  [53] backports_1.1.2               httpuv_1.4.5                 
##  [55] Hmisc_4.1-1                   tools_3.5.1                  
##  [57] bookdown_0.7                  psych_1.8.4                  
##  [59] logging_0.7-103               ggplot2_3.0.0                
##  [61] affyio_1.50.0                 assertive.strings_0.0-3      
##  [63] RColorBrewer_1.1-2            sessioninfo_1.0.0            
##  [65] Rcpp_0.12.18                  plyr_1.8.4                   
##  [67] base64enc_0.1-3               progress_1.2.0               
##  [69] zlibbioc_1.26.0               purrr_0.2.5                  
##  [71] RCurl_1.95-4.11               prettyunits_1.0.2            
##  [73] rpart_4.1-13                  viridis_0.5.1                
##  [75] pbapply_1.3-4                 GetoptLong_0.1.7             
##  [77] cowplot_0.9.3                 haven_1.1.2                  
##  [79] grr_0.9.5                     ggrepel_0.8.0                
##  [81] cluster_2.0.7-1               fs_1.2.5                     
##  [83] magrittr_1.5                  data.table_1.11.4            
##  [85] assertive.data_0.0-1          openxlsx_4.1.0               
##  [87] SparseM_1.77                  circlize_0.4.4               
##  [89] ProtGenerics_1.12.0           hms_0.4.2                    
##  [91] mime_0.5                      evaluate_0.11                
##  [93] xtable_1.8-2                  XML_3.98-1.16                
##  [95] rio_0.5.10                    readxl_1.1.0                 
##  [97] gridExtra_2.3                 shape_1.4.4                  
##  [99] compiler_3.5.1                biomaRt_2.36.1               
## [101] tibble_1.4.2                  crayon_1.3.4                 
## [103] R.oo_1.22.0                   htmltools_0.3.6              
## [105] later_0.7.3                   Formula_1.2-3                
## [107] tidyr_0.8.1                   geneplotter_1.58.0           
## [109] DBI_1.0.0                     assertive.files_0.0-2        
## [111] ComplexHeatmap_1.18.1         MASS_7.3-50                  
## [113] assertive.numbers_0.0-2       Matrix_1.2-14                
## [115] readr_1.1.1                   cli_1.0.0                    
## [117] vsn_3.48.1                    assertive.types_0.0-3        
## [119] R.methodsS3_1.7.1             bindr_0.1.1                  
## [121] forcats_0.3.0                 pkgconfig_2.0.2              
## [123] pkgdown_1.1.0                 GenomicAlignments_1.16.0     
## [125] foreign_0.8-71                xml2_1.2.0                   
## [127] roxygen2_6.1.0                annotate_1.58.0              
## [129] XVector_0.20.0                stringr_1.3.1                
## [131] digest_0.6.16                 ConsensusClusterPlus_1.44.0  
## [133] assertive.code_0.0-1          Biostrings_2.48.0            
## [135] cellranger_1.1.0              rmarkdown_1.10               
## [137] htmlTable_1.12                edgeR_3.22.3                 
## [139] curl_3.2                      shiny_1.1.0                  
## [141] Rsamtools_1.32.3              commonmark_1.5               
## [143] quantreg_5.36                 rjson_0.2.20                 
## [145] nlme_3.1-137                  jsonlite_1.5                 
## [147] DEGreport_1.16.0              viridisLite_0.3.0            
## [149] fansi_0.3.0                   desc_1.2.0                   
## [151] limma_3.36.2                  pillar_1.3.0                 
## [153] lattice_0.20-35               Nozzle.R1_1.1-1              
## [155] httr_1.3.1                    survival_2.42-6              
## [157] interactiveDisplayBase_1.18.0 glue_1.3.0                   
## [159] zip_1.0.0                     bit_1.1-14                   
## [161] assertive.properties_0.0-4    stringi_1.2.4                
## [163] blob_1.1.1                    AnnotationHub_2.12.0         
## [165] latticeExtra_0.6-28           memoise_1.1.0                
## [167] dplyr_0.7.6

References

The papers and software cited in our workflows are available as a shared library on Paperpile.

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal Probabilistic RNA-seq Quantification.” Nat. Biotechnol. 34 (5): 525–27. doi:10.1038/nbt.3519.

Craciun, Florin L, Vanesa Bijol, Amrendra K Ajay, Poornima Rao, Ramya K Kumar, John Hutchinson, Oliver Hofmann, et al. 2016. “RNA Sequencing Identifies Novel Translational Biomarkers of Kidney Fibrosis.” J. Am. Soc. Nephrol. 27 (6): 1702–13. doi:10.1681/ASN.2015020225.

Dobin, Alexander, and Thomas R Gingeras. 2016. “Optimizing RNA-Seq Mapping with STAR.” Methods Mol. Biol. 1415: 245–62. doi:10.1007/978-1-4939-3572-7\_13.

Dobin, Alexander, Carrie A Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R Gingeras. 2013. “STAR: Ultrafast Universal RNA-seq Aligner.” Bioinformatics 29 (1): 15–21. doi:10.1093/bioinformatics/bts635.

Huber, Wolfgang, Vincent J Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S Carvalho, Hector Corrada Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nat. Methods 12 (2): 115–21. doi:10.1038/nmeth.3252.

Kanehisa, M, and S Goto. 2000. “KEGG: Kyoto Encyclopedia of Genes and Genomes.” Nucleic Acids Res. 28 (1): 27–30. https://www.ncbi.nlm.nih.gov/pubmed/10592173.

Kim, Daehwan, Ben Langmead, and Steven L Salzberg. 2015. “HISAT: A Fast Spliced Aligner with Low Memory Requirements.” Nat. Methods 12 (4): 357–60. doi:10.1038/nmeth.3317.

Lawrence, Michael, Wolfgang Huber, Hervé Pagès, Patrick Aboyoun, Marc Carlson, Robert Gentleman, Martin T Morgan, and Vincent J Carey. 2013. “Software for Computing and Annotating Genomic Ranges.” PLoS Comput. Biol. 9 (8): e1003118. doi:10.1371/journal.pcbi.1003118.

Liao, Yang, Gordon K Smyth, and Wei Shi. 2014. “FeatureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic Features.” Bioinformatics 30 (7): 923–30. doi:10.1093/bioinformatics/btt656.

Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Genome Biol. 15 (12): 550. doi:10.1186/s13059-014-0550-8.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14 (4): 417–19. doi:10.1038/nmeth.4197.

Patro, Rob, Stephen M Mount, and Carl Kingsford. 2014. “Sailfish Enables Alignment-Free Isoform Quantification from RNA-seq Reads Using Lightweight Algorithms.” Nat. Biotechnol. 32 (5): 462–64. doi:10.1038/nbt.2862.

Soneson, Charlotte, Michael I Love, and Mark D Robinson. 2016. “Differential Analyses for RNA-seq: Transcript-Level Estimates Improve Gene-Level Inferences.” F1000Res. 4 (December). doi:10.12688/f1000research.7563.1.

Steinbaugh, Michael J, Lorena Pantano, Rory D Kirchner, Victor Barrera, Brad A Chapman, Mary E Piper, Meeta Mistry, et al. 2018. “bcbioRNASeq: R Package for Bcbio RNA-seq Analysis.” F1000Res. 6 (June). doi:10.12688/f1000research.12093.2.

Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proc. Natl. Acad. Sci. U. S. A. 102 (43): 15545–50. doi:10.1073/pnas.0506580102.

Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS 16 (5): 284–87. doi:10.1089/omi.2011.0118.