RNA-Seq Pipelines

Table Of Contents

  1. Introduction
  2. Phase 2 - Differential Expression
  3. Phase 1 - Alignment and Quantification

1.  Introduction

Most of the RNA-Seq differential expression analyses that NGSC does is accomplished in two phases. During the first phase reads are aligned to the genome and transcripts and reports is generated which give the raw expression counts for transcripts, introns, exons, and splice junctions. The in the second phase the transcript counts are collected and colated, various quality assessment plots are generated, and finally a set of Excel files are produced which contain the fold changes, p-values and FDRs for the various comparisons in the experiment.

Since these Excel files (and other related files) the ones you will look at first, this document will describe them first. Then further down we will go on to describe the files produced by the align phase.


2.  Phase 2 - Differential Expression

Files

By default the files are created in Analysis/DiffExp/TAG, where TAG is used to distinguish different analyses or diffferent variants of an analysis. In this directory, you may find multiple analyses which use different data and/or parameters. For example, if we find that some of the replicates are bad, we will leave them out and redo with just the good samples.

In older analyses, you will see 3 to 4 files with names like Compare.*, the most useful of which is Compare.tab.xls. However, Compare.tab.xls turns out to be very large for complex analyses so more recent versions of the pipeline alwo write a series of smaller files.

Compare.tab.xls

This file contains the comparison data. The contents are somewhat flexible, but will follow this outline. Each row is a transcript. The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.) The next set of columns are various comparisons. Which comparisons are done depend on the experiment. For each comparison there are 6 columns.

Here are some more details about the columns.

Handy Section Files

The smaller files created by newer versions of the pipeline have names like Compare.tab.1.samr.csv or Compare.tab.1.samr.xls. The 1 indicates which comparison is contained in the file. The comparison tool is either mva, samr, or edger. Both csv and xls files are readable Excel, but informatics collaborators may prefer the text-based csv files.

The file AAA-Comparison-Catalog.csv lists which comparison correspond to each number.

The all files contain the read counts ( reads ) and quantile normalized lg read counts ( lg.qn ) for all samples.

Column names in these files are the same as those in Compare.tab.xls.

Looking Deeper

Within each Comparison folder is another called Heatmap.

Introduction

We routinely run the pipeline RUM-MultipleComparisons to assess RNA-Seq data. Although the tool includes the work 'RUM' in the title, it can work with gene expression values from a variety of RNA-Seq tools.

We are still expanding what analyses RUM-MultipleComparisons performs but at the moment, it includes these basic steps.

What Files Should I Look At?

First, take a look at the plot, Replicates and Kmeans-heatmap.pdf files so that you can see if the samples have good intra-condition consistency. In addition, the heatmap file will help you see if the changes between conditions are consistent across samples, and roughly how many sets of expression patterns there are in the set.

Once you can see that the data is ok, turn to the Averages.tab file or the appropriate Kmeans-*-clusters.tab file to see gene IDs. All of the tab files can be opened from within Excel which can be used to further filter the genes. Gene lists can also be created for use with functional analysis.

How Do we Usually Run It?

We usually focus on well-characterized RefSeqs, i.e., those with IDs like NM_* or NR_*.


3.  Phase 1 - Alignment and Quantification

Plots

Alignment and Expression Values

Like most experiments, we initially process RNA-Seq samples individually in a run/lane/barcode (or RLB) phase. We use RUM for this step. Next we proceed to call differentially expressed genes using a custom pipeline described below.

Cleaning

We first align reads to ribosomal sequences and repeats remove these reads and to assess the level of ribosomal sequence in the libraries.

RUM

We use the RUM package from Grant et al to do the basic processing of RNA-Seq data. RUM generates a set of files which we then process a bit further to make them visible in the TessLA browser and for other down-stream analyses.

Files

Here is a typical set of files produced for a RUM analysis with typical files sizes in bytes.

Size File name Description
26642352304 RUM.sam all alignments
9094911272 RUM_NU non-unique alignments
48715025 RUM_NU.bedGraph.gz bedGraph format for display
1002296 RUM_NU.bedGraph.gz.tbi index of bedGraph format for display
213897294 RUM_NU.cov coverage data for non-uniquely mapping reads
3687201046 RUM_Unique unique alignments
52517756 RUM_Unique.bedGraph.gz bedGraph format for display
830599 RUM_Unique.bedGraph.gz.tbi index of bedGraph format for display
228549781 RUM_Unique.cov coverage data for uniquely mapping reads
122977785 feature_quantifications-max.tab
122977785 feature_quantifications-max.tab-sorted
122977785 feature_quantifications-min.tab
122977785 feature_quantifications-min.tab-sorted
111522316 feature_quantifications_RLB-GENOME-TAG expression levels of transcript, exons, and introns.
5417514 inferred_internal_exons.bed
3126115 inferred_internal_exons.txt
30203897 junctions_all.bed
30203808 junctions_all.bed-sorted
18500890 junctions_all.rum
9149089 junctions_high-quality.bed
9148994 junctions_high-quality.bed-sorted
16384 log
4142 mapping_stats.txt summary of how many reads mapped to genome or transcripts
289688 novel_inferred_internal_exons_quantifications_RLB-GENOME-TAG
16384 postproc
5438152292 quals.fa read qualities
5438152292 reads.fa read sequences
449 rum_RLB-GENOME-TAG_preproc.sh
848 rumRLB-GENOME-TAG_proc.sh
1837 rum_job_config
3275 rum_job_report.txt
1384 rum_runner.log
125 rum_sge_job_ids