NGSC - FAQS - Pipelines

NGSC Pipeline for HITS-CLIP Data Analysis

Introduction
Materials
Trimming
Alignments
Summaries
Footprints
Normalizing Footprint Strength
Connecting miRNAs to Footprints

1. Introduction

The HITS-CLIP (or CLIP-Seq) pipeline is designed to handle a pair of libraries - one enriched for the miRNAs (short) and another enriched for the RNA fragments (long). Since the enrichment is not perfect, the initial processing of the two libraries is similar. Later, when identifying Ago footprints in RNAs, the long library is more productive. The major steps are

trimming adapter sequences from the 3' end of the reads
aligning trimmed reads to the following sets of sequences
1. miRNA hairpin pre-cursors
2. RefSeq RNA sequences
3. whole genome
plotting the alignment count by length to these three sets
counting overlaps with mature miRNAs
grouping alignments on RefSeqs to identify and measure the strength of RNA footprints.

2. Materials

The FASTQ files are located under FASTQ in the investigation folder. The link between samples and FASTQ files is documented in the AAA-StudyInfo.xls (or .csv) file at the top of the investigation folder. FASTQ files are named for the run, lane, and optionally barcode of the sequenced library. This information is in the columns RULA_Run,RULA_Lane, and RULA_Barcode.

3. Trimming

Trimming is done with a program the NGSC wrote (in about 2009 before there were many alternatives). It applies loose matching to a prefix of the supplied adapter sequence. If the adapter is matches, the read is truncated at that point. The truncated reads are written to a FASTA (not FASTQ) file located in basic/Trimmed.

4. Alignments

All trimmed reads are aligned with bowtie to the three 'genomes' mentioned above. The bowtie files are located in basic/Bowtie. We output both the bowtie file (*.bowtie) and the non-aligning reads (*.untie.fastq - even though these are FASTA files!) The bowtie files may later be converted to BAM files and indexed for better visualization in the IGV browser, for example.

5. Summaries

The following files, under basic/Summary, contain summaries of expression or alignment quality for each run/lane/barcode aligned.

*.hsa_mirna.tab
- summary of expression of mature-form miRNAs
- columns - miRNA name, read count, RPM value
*.olin.pdf
- plot of aligned reads by length
- pages
  - 1 read count
  - 2 percent aligning
- what to look for
  - miRNA counts should be well above 50% for short library at around 21 bp
  - genome or RefSeq counts should be high for longer fragments
*.olin.tab
- the raw read counts that are used to make *.olin.pdf

In general the pipeline does not remove duplicate reads. This is because the miRNA reads basically have to be exactly the same. Similarly, since the mRNA reads are defined by RNAse digestion, we expect many of them to have the same end points. Thus it is difficult or impossible to determine if read duplication is due to PCR during library prep or the mechanism that produced the RNA fragment originally.

6. Footprints

Footprints are stored under Analysis/Footprints. The footprint process is as follows. We count the number of reads with alignments that start at each position on all RefSeqs. We then make footprints by merging 'nearby' positions into footprints. The merging is done such that weaker positions are merged into stronger ones when they are nearby. Nearby usually means 10bp. The assumption is that a given RISC complex binding event can generate a few different pieces of mRNA in the library depending on where the nuclease cuts the mRNA. Reads from these different versions should be merged together to gauge the strength of the footprint.

The files produced are as follows.

*.starts.bed
- a BED file with the position and strength of the alignment start positions
- each position is a single base
- the names are just integers from 1 on up
- the score is the number of reads that start at that position
- the strand is always positive since we think only positive alignments make sense
*.fp.bed
- a BED file with the position and strength of the footprints
- each position is still a single base
  - we do not attempt to figure out how long the footprint is
- the names are slightly more informative, but are still automatically generated
- the score is the total number of reads merged into the footprint.
*.fp.bed+gene
- a BED-like file
- columns are as in *.fp.bed, but the transcript's gene symbol has been added
- the file has been sorted in order decreasing footprint strength

As currently implemented this process can take a long time. With read counts in a few 10s of millions the duration is acceptable. However, for larger data sets, it may take too long, so we can limit the number of reads. When a number appears in the file name, it is the number of reads used in the analysis.

7. Normalizing Footprint Strength

The footprint strengths are reported as raw reads. When comparing samples these should be normalized for sequencing or analysis depth by converting to reads per million. This is not done at the moment.

Also it can be important to normalize footprint strength to the expression level of the transcript, or better yet, to the exon that contains the footprint. The reason for this is that a footprint of, say 1000 reads, would be a strong footprint on a low-expressing gene, but should be considered weaker if found in a highly-expressed transcript. This normalization is not done as it requires RNA-Seq or other expression data.

8. Connecting miRNAs to Footprints

The next natural step in the analysis is to connect miRNAs to footprints by identifying the potential binding miRNAs using seed-sequence matches or other techniques. Due the unreliability of these techniques at the time, this step is not incorporated into the pipeline.