Simply point to the final upload directory output by bcbio, and this function will take care of the rest. It automatically imports RNA-seq counts, metadata, and program versions used.

loadRNASeq(uploadDir, interestingGroups = "sampleName",
  sampleMetadataFile = NULL, annotable, ensemblVersion = NULL,
  transformationLimit = 50, ...)

Arguments

uploadDir

Path to final upload directory. This path is set when running bcbio_nextgen -w template.

interestingGroups

Character vector of interesting groups. First entry is used for plot colors during quality control (QC) analysis. Entire vector is used for PCA and heatmap QC functions.

sampleMetadataFile

Optional. Custom metadata file containing sample information. Otherwise defaults to sample metadata saved in the YAML file.

annotable

Optional. User-defined gene annotations (a.k.a. "annotable"), which will be slotted into rowData(). Typically this should be left undefined. By default, the function will automatically generate an annotable from the annotations available on Ensembl. If set NULL, then rowData() inside the resulting bcbioRNASeq object will be left empty. This is recommended for projects dealing with genes or transcripts that are poorly annotated.

ensemblVersion

Optional. Ensembl release version. If NULL, defaults to current release, and does not typically need to be user-defined. This parameter can be useful for matching Ensembl annotations against an outdated bcbio annotation build.

transformationLimit

Maximum number of samples to calculate DESeq2::rlog() and DESeq2::varianceStabilizingTransformation() matrix. It is not generally recommended to change this value. For large datasets, DESeq2 will take a really long time applying variance stabilization. See Details.

...

Additional arguments, slotted into the metadata() accessor.

Value

bcbioRNASeq.

Details

When number of samples is bigger than transformationLimit, rlog and vst counts will not be slotted into assays(). In this case, we recommend visualization using the tmm counts generated by edgeR.

Note

When working in RStudio, we recommend connecting to the bcbio-nextgen run directory as a remote connection over sshfs.

Examples

uploadDir <- system.file( file.path("extdata", "bcbio"), package = "bcbioRNASeq") bcb <- loadRNASeq(uploadDir, interestingGroups = "group")
#> 2017-05-23_rnaseq
#> 4 samples detected
#> Reading project-summary.yaml
#> Genome: Mus musculus (mm10)
#> Loading Ensembl annotations from AnnotationHub #> 2017-10-27
#> EnsDB AH57770: Mus musculus Ensembl 90
#> Parsed with column specification: #> cols( #> enstxp = col_character(), #> ensgene = col_character() #> )
#> Reading sample metrics
#> Reading bcbio run information
#> Parsed with column specification: #> cols( #> genome = col_character(), #> resource = col_character(), #> version = col_datetime(format = "") #> )
#> Warning: bcbio-nextgen.log missing
#> Warning: bcbio-nextgen-commands.log missing
#> Reading salmon counts using tximport
#> 1
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#> 2
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#> 3
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#> 4
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#>
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Performing trimmed mean of M-values (TMM) normalization
#> Generating internal DESeqDataSet for quality control
#> using just counts from tximport
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Performing rlog transformation
#> Performing variance stabilizing transformation
#> Reading STAR featureCounts aligned counts
#> Parsed with column specification: #> cols( #> .default = col_integer(), #> id = col_character() #> )
#> See spec(...) for full column specifications.
#> Warning: Unannotated genes detected in counts matrix (0.594%)
print(bcb)
#> class: bcbioRNASeq #> dim: 505 4 #> metadata(27): version uploadDir ... devtoolsSessionInfo #> unannotatedGenes #> assays(6): raw normalized ... rlog vst #> rownames(505): ENSMUSG00000002459 ENSMUSG00000004768 ... #> ENSMUSG00000105982 ENSMUSG00000109048 #> rowData names(11): ensgene symbol ... seqCoordSystem entrez #> colnames(4): group1_1 group1_2 group2_1 group2_2 #> colData names(4): sampleID sampleName description group
# Load without gene annotations # Advanced use only! Not generally recommended. bcb <- loadRNASeq(uploadDir, annotable = NULL)
#> 2017-05-23_rnaseq
#> 4 samples detected
#> Reading project-summary.yaml
#> Genome: Mus musculus (mm10)
#> Parsed with column specification: #> cols( #> enstxp = col_character(), #> ensgene = col_character() #> )
#> Reading sample metrics
#> Reading bcbio run information
#> Parsed with column specification: #> cols( #> genome = col_character(), #> resource = col_character(), #> version = col_datetime(format = "") #> )
#> Warning: bcbio-nextgen.log missing
#> Warning: bcbio-nextgen-commands.log missing
#> Reading salmon counts using tximport
#> 1
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#> 2
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#> 3
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#> 4
#> Parsed with column specification: #> cols( #> Name = col_character(), #> Length = col_integer(), #> EffectiveLength = col_double(), #> TPM = col_double(), #> NumReads = col_double() #> )
#>
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Performing trimmed mean of M-values (TMM) normalization
#> Generating internal DESeqDataSet for quality control
#> using just counts from tximport
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> Performing rlog transformation
#> Performing variance stabilizing transformation
#> Reading STAR featureCounts aligned counts
#> Parsed with column specification: #> cols( #> .default = col_integer(), #> id = col_character() #> )
#> See spec(...) for full column specifications.
print(bcb)
#> class: bcbioRNASeq #> dim: 505 4 #> metadata(26): version uploadDir ... utilsSessionInfo #> devtoolsSessionInfo #> assays(6): raw normalized ... rlog vst #> rownames(505): ENSMUSG00000002459 ENSMUSG00000004768 ... #> ENSMUSG00000105982 ENSMUSG00000109048 #> rowData names(0): #> colnames(4): group1_1 group1_2 group2_1 group2_2 #> colData names(4): sampleID sampleName description group
rowData(bcb)
#> DataFrame with 505 rows and 0 columns