NGSC - FAQs |
Next-Generation Sequencing Core Perelman School of Medicine | University of Pennsylvania |
Most of the RNA-Seq differential expression analyses that NGSC does is accomplished in two phases. During the first phase reads are aligned to the genome and transcripts and reports is generated which give the raw expression counts for transcripts, introns, exons, and splice junctions. The in the second phase the transcript counts are collected and colated, various quality assessment plots are generated, and finally a set of Excel files are produced which contain the fold changes, p-values and FDRs for the various comparisons in the experiment.
Since these Excel files (and other related files) the ones you will look at first, this document will describe them first. Then further down we will go on to describe the files produced by the align phase.
By default the files are created in Analysis/DiffExp/TAG
, where TAG
is used to distinguish different analyses or
diffferent variants of an analysis. In this directory, you may find multiple analyses which use different data and/or
parameters. For example, if we find that some of the replicates are bad, we will leave them out and redo with just the good
samples.
In older analyses, you will see 3 to 4 files with names like Compare.*
, the most useful of which is Compare.tab.xls
.
However, Compare.tab.xls
turns out to be very large for complex analyses so more recent versions of the pipeline alwo write
a series of smaller files.
Compare.tab.xls
This file contains the comparison data. The contents are somewhat flexible, but will follow this outline. Each row is a transcript. The first few columns contain the gene, transcript, and 'Best' (an indicator which guides you to the best transcript for each genes.) The next set of columns are various comparisons. Which comparisons are done depend on the experiment. For each comparison there are 6 columns.
MVA:M:Test:Control
= log2 test/control fold changeMVA:A:Test:Control
= log2 average expressionEDGE:A:Test:Control
= log2 average expressionEDGE:M:Test:Control
= log2 test/control fold changeEDGE:pv:Test:Control
= 0-1 p-valueEDGE:FDR:Test:Control
= 0-1 FDR from p-value using Benjamini-Hochberg correctionHere are some more details about the columns.
MVA
is a simple MvA comparison with no statistical significance. We run this for every comparison.SAMR
when we have at least three replicates.Test`` and
Control' words are replaced with the actual conditions used in the comparision, e.g. KO' and
WT'.The smaller files created by newer versions of the pipeline have names like
Compare.tab.1.samr.csv
or Compare.tab.1.samr.xls
. The 1
indicates
which comparison is contained in the file. The comparison tool is either mva
,
samr
, or edger
. Both csv
and xls
files are readable
Excel, but informatics collaborators may prefer the text-based csv files.
The file AAA-Comparison-Catalog.csv
lists which comparison correspond to each number.
The all
files contain the read counts ( reads
) and quantile normalized
lg read counts ( lg.qn
) for all samples.
Column names in these files are the same as those in Compare.tab.xls
.
Within each Comparison folder is another called Heatmap
.
We routinely run the pipeline RUM-MultipleComparisons
to assess RNA-Seq
data. Although the tool includes the work 'RUM' in the title, it can
work with gene expression values from a variety of RNA-Seq tools.
We are still expanding what analyses RUM-MultipleComparisons
performs
but at the moment, it includes these basic steps.
First, take a look at the plot, Replicates
and Kmeans-heatmap.pdf
files
so that you can see if the samples have good intra-condition
consistency. In addition, the heatmap file will help you see if the
changes between conditions are consistent across samples, and roughly
how many sets of expression patterns there are in the set.
Once you can see that the data is ok, turn to the Averages.tab
file or
the appropriate Kmeans-*-clusters.tab
file to see gene IDs. All of the
tab files can be opened from within Excel which can be used to further
filter the genes. Gene lists can also be created for use with
functional analysis.
We usually focus on well-characterized RefSeqs, i.e., those with IDs
like NM_*
or NR_*
.
Like most experiments, we initially process RNA-Seq samples individually in a run/lane/barcode (or RLB) phase. We use RUM for this step. Next we proceed to call differentially expressed genes using a custom pipeline described below.
We first align reads to ribosomal sequences and repeats remove these reads and to assess the level of ribosomal sequence in the libraries.
We use the RUM package from Grant et al to do the basic processing of RNA-Seq data. RUM generates a set of files which we then process a bit further to make them visible in the TessLA browser and for other down-stream analyses.
Here is a typical set of files produced for a RUM analysis with typical files sizes in bytes.
Size | File name | Description |
26642352304 | RUM.sam | all alignments |
9094911272 | RUM_NU | non-unique alignments |
48715025 | RUM_NU.bedGraph.gz | bedGraph format for display |
1002296 | RUM_NU.bedGraph.gz.tbi | index of bedGraph format for display |
213897294 | RUM_NU.cov | coverage data for non-uniquely mapping reads |
3687201046 | RUM_Unique | unique alignments |
52517756 | RUM_Unique.bedGraph.gz | bedGraph format for display |
830599 | RUM_Unique.bedGraph.gz.tbi | index of bedGraph format for display |
228549781 | RUM_Unique.cov | coverage data for uniquely mapping reads |
122977785 | feature_quantifications-max.tab | |
122977785 | feature_quantifications-max.tab-sorted | |
122977785 | feature_quantifications-min.tab | |
122977785 | feature_quantifications-min.tab-sorted | |
111522316 | feature_quantifications_RLB-GENOME-TAG | expression levels of transcript, exons, and introns. |
5417514 | inferred_internal_exons.bed | |
3126115 | inferred_internal_exons.txt | |
30203897 | junctions_all.bed | |
30203808 | junctions_all.bed-sorted | |
18500890 | junctions_all.rum | |
9149089 | junctions_high-quality.bed | |
9148994 | junctions_high-quality.bed-sorted | |
16384 | log | |
4142 | mapping_stats.txt | summary of how many reads mapped to genome or transcripts |
289688 | novel_inferred_internal_exons_quantifications_RLB-GENOME-TAG | |
16384 | postproc | |
5438152292 | quals.fa | read qualities |
5438152292 | reads.fa | read sequences |
449 | rum_RLB-GENOME-TAG_preproc.sh | |
848 | rumRLB-GENOME-TAG_proc.sh | |
1837 | rum_job_config | |
3275 | rum_job_report.txt | |
1384 | rum_runner.log | |
125 | rum_sge_job_ids |