NGSC - FAQS - RNA-Seq Pipelines

RNA-Seq Pipelines

Introduction
Phase 2 - Differential Expression
Phase 1 - Alignment and Quantification

1. Introduction

Most of the RNA-Seq differential expression analyses that NGSC does is accomplished in two phases. During the first phase reads are aligned to the genome and transcripts and reports is generated which give the raw expression counts for transcripts, introns, exons, and splice junctions. The in the second phase the transcript counts are collected and colated, various quality assessment plots are generated, and finally a set of Excel files are produced which contain the fold changes, p-values and FDRs for the various comparisons in the experiment.

Since these Excel files (and other related files) the ones you will look at first, this document will describe them first. Then further down we will go on to describe the files produced by the align phase.

2. Phase 2 - Differential Expression

Files

By default the files are created in Analysis/DiffExp/TAG, where TAG is used to distinguish different analyses or diffferent variants of an analysis. In this directory, you may find multiple analyses which use different data and/or parameters. For example, if we find that some of the replicates are bad, we will leave them out and redo with just the good samples.

In older analyses, you will see 3 to 4 files with names like Compare.*, the most useful of which is Compare.tab.xls. However, Compare.tab.xls turns out to be very large for complex analyses so more recent versions of the pipeline alwo write a series of smaller files.

`Compare.tab.xls`

This file contains the comparison data. The contents are somewhat flexible, but will follow this outline. Each row is a transcript. The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.) The next set of columns are various comparisons. Which comparisons are done depend on the experiment. For each comparison there are 6 columns.

MVA:M:Test:Control = log2 test/control fold change
MVA:A:Test:Control = log2 average expression
EDGE:A:Test:Control = log2 average expression
EDGE:M:Test:Control = log2 test/control fold change
EDGE:pv:Test:Control = 0-1 p-value
EDGE:FDR:Test:Control = 0-1 FDR from p-value using Benjamini-Hochberg correction

Here are some more details about the columns.

The first word in each column title indicates the tool that is used to produce the data in the column.
- MVA is a simple MvA comparison with no statistical significance. We run this for every comparison.
- `EDGE`` is the EdgeR package which performs differential gene expression on RNA-Seq data. We run this for comparisons that have at least two replicates.
- We also use a third tool SAMR when we have at least three replicates.
- The data that is passed to the analysis programs has been quantile normalized.
The second word indicates what type of data is in the column.
- M values are the log2(Test/Control), so M=1 indicates 2-fold increase in expression.
- A values are log2 of the average expression between two conditions. MvA and EdgeR use different units, MVA is usually Reads, whereas EdgeR values have been normalized to counts per million.
The Test`` andControl' words are replaced with the actual conditions used in the comparision, e.g. KO' andWT'.
The next set of columns of the file are quantile normalized log2 versions of the 'raw' data for the individual samples.
The last set of columns are the 'raw' data which is usually reads.

Handy Section Files

The smaller files created by newer versions of the pipeline have names like Compare.tab.1.samr.csv or Compare.tab.1.samr.xls. The 1 indicates which comparison is contained in the file. The comparison tool is either mva, samr, or edger. Both csv and xls files are readable Excel, but informatics collaborators may prefer the text-based csv files.

The file AAA-Comparison-Catalog.csv lists which comparison correspond to each number.

The all files contain the read counts ( reads ) and quantile normalized lg read counts ( lg.qn ) for all samples.

Column names in these files are the same as those in Compare.tab.xls.

Looking Deeper

Within each Comparison folder is another called Heatmap.

Introduction

We routinely run the pipeline RUM-MultipleComparisons to assess RNA-Seq data. Although the tool includes the work 'RUM' in the title, it can work with gene expression values from a variety of RNA-Seq tools.

We are still expanding what analyses RUM-MultipleComparisons performs but at the moment, it includes these basic steps.

Assemble a table of the raw data
Filter to consider just transcripts
Performs quantile normalization of the values
Does a series of k-means clustering of the data and displays results as heatmaps
Generates MvA plots of averages for all conditions
Generates MvA plots of replicates within a condition
Tabulates fold-changes between average values for all conditions

What Files Should I Look At?

First, take a look at the plot, Replicates and Kmeans-heatmap.pdf files so that you can see if the samples have good intra-condition consistency. In addition, the heatmap file will help you see if the changes between conditions are consistent across samples, and roughly how many sets of expression patterns there are in the set.

Once you can see that the data is ok, turn to the Averages.tab file or the appropriate Kmeans-*-clusters.tab file to see gene IDs. All of the tab files can be opened from within Excel which can be used to further filter the genes. Gene lists can also be created for use with functional analysis.

How Do we Usually Run It?

We usually focus on well-characterized RefSeqs, i.e., those with IDs like NM_* or NR_*.

3. Phase 1 - Alignment and Quantification

Plots

AllPairs-mva.png - a comparison of all samples in the data set.
Kmeans-heatmap.pdf - series of heatmaps using different numbers of clusters. Yellow/white is high expression, red is low.
Pairs.pdf - MvA plots of all condition comparisons
Replicates-mva.pdf - MvA plots of replicates within a condition
Tables of Data
AllTranscriptReadCounts-sql.tab - initial raw data
AllTranscriptReadCounts.tab - data filtered to just transcripts
Averages.tab - averages over conditions with fold-changes for all comparison
Details-Lg2-Qn.tab - quantile normalized values for individual samples
Kmeans-04-clusters.tab - details of genes in each cluster.
Kmeans-05-clusters.tab
Kmeans-06-clusters.tab
...
Kmeans--28-clusters.tab
Kmeans-29-clusters.tab
Kmeans-30-clusters.tab

Alignment and Expression Values

Like most experiments, we initially process RNA-Seq samples individually in a run/lane/barcode (or RLB) phase. We use RUM for this step. Next we proceed to call differentially expressed genes using a custom pipeline described below.

Cleaning

We first align reads to ribosomal sequences and repeats remove these reads and to assess the level of ribosomal sequence in the libraries.

RUM

We use the RUM package from Grant et al to do the basic processing of RNA-Seq data. RUM generates a set of files which we then process a bit further to make them visible in the TessLA browser and for other down-stream analyses.

Files

Here is a typical set of files produced for a RUM analysis with typical files sizes in bytes.

Size	File name	Description
26642352304	RUM.sam	all alignments
9094911272	RUM_NU	non-unique alignments
48715025	RUM_NU.bedGraph.gz	bedGraph format for display
1002296	RUM_NU.bedGraph.gz.tbi	index of bedGraph format for display
213897294	RUM_NU.cov	coverage data for non-uniquely mapping reads
3687201046	RUM_Unique	unique alignments
52517756	RUM_Unique.bedGraph.gz	bedGraph format for display
830599	RUM_Unique.bedGraph.gz.tbi	index of bedGraph format for display
228549781	RUM_Unique.cov	coverage data for uniquely mapping reads
122977785	feature_quantifications-max.tab
122977785	feature_quantifications-max.tab-sorted
122977785	feature_quantifications-min.tab
122977785	feature_quantifications-min.tab-sorted
111522316	feature_quantifications_RLB-GENOME-TAG	expression levels of transcript, exons, and introns.
5417514	inferred_internal_exons.bed
3126115	inferred_internal_exons.txt
30203897	junctions_all.bed
30203808	junctions_all.bed-sorted
18500890	junctions_all.rum
9149089	junctions_high-quality.bed
9148994	junctions_high-quality.bed-sorted
16384	log
4142	mapping_stats.txt	summary of how many reads mapped to genome or transcripts
289688	novel_inferred_internal_exons_quantifications_RLB-GENOME-TAG
16384	postproc
5438152292	quals.fa	read qualities
5438152292	reads.fa	read sequences
449	rum_RLB-GENOME-TAG_preproc.sh
848	rumRLB-GENOME-TAG_proc.sh
1837	rum_job_config
3275	rum_job_report.txt
1384	rum_runner.log
125	rum_sge_job_ids

RNA-Seq Pipelines

Table Of Contents

1. Introduction

2. Phase 2 - Differential Expression

Files

Compare.tab.xls

Handy Section Files

Looking Deeper

Introduction

What Files Should I Look At?

How Do we Usually Run It?

3. Phase 1 - Alignment and Quantification

Plots

Alignment and Expression Values

Cleaning

RUM

Files

`Compare.tab.xls`