NGSC - FAQs |
Next-Generation Sequencing Core Perelman School of Medicine | University of Pennsylvania |
![]() |
The HITS-CLIP (or CLIP-Seq) pipeline is designed to handle a pair of libraries - one enriched for the miRNAs (short) and another enriched for the RNA fragments (long). Since the enrichment is not perfect, the initial processing of the two libraries is similar. Later, when identifying Ago footprints in RNAs, the long library is more productive. The major steps are
The FASTQ files are located under FASTQ
in the investigation folder. The link between samples and FASTQ files is documented in the AAA-StudyInfo.xls
(or .csv
) file at the top of the investigation folder. FASTQ files are named for the run, lane, and optionally barcode of the sequenced library. This information is in the columns RULA_Run
,RULA_Lane
, and RULA_Barcode
.
Trimming is done with a program the NGSC wrote (in about 2009 before there were many alternatives). It applies loose matching to a prefix of the supplied adapter sequence. If the adapter is matches, the read is truncated at that point. The truncated reads are written to a FASTA (not FASTQ) file located in basic/Trimmed
.
All trimmed reads are aligned with bowtie to the three 'genomes' mentioned above. The bowtie files are located in basic/Bowtie
. We output both the bowtie file (*.bowtie
) and the non-aligning reads (*.untie.fastq
- even though these are FASTA files!) The bowtie files may later be converted to BAM files and indexed for better visualization in the IGV browser, for example.
The following files, under basic/Summary
, contain summaries of expression or alignment quality for each run/lane/barcode aligned.
*.hsa_mirna.tab
*.olin.pdf
*.olin.tab
*.olin.pdf
In general the pipeline does not remove duplicate reads. This is because the miRNA reads basically have to be exactly the same. Similarly, since the mRNA reads are defined by RNAse digestion, we expect many of them to have the same end points. Thus it is difficult or impossible to determine if read duplication is due to PCR during library prep or the mechanism that produced the RNA fragment originally.
Footprints are stored under Analysis/Footprints
. The footprint process is as follows. We count the number of reads with alignments that start at each position on all RefSeqs. We then make footprints by merging 'nearby' positions into footprints. The merging is done such that weaker positions are merged into stronger ones when they are nearby. Nearby
usually means 10bp. The assumption is that a given RISC complex binding event can generate a few different pieces of mRNA in the library depending on where the nuclease cuts the mRNA. Reads from these different versions should be merged together to gauge the strength of the footprint.
The files produced are as follows.
*.starts.bed
*.fp.bed
*.fp.bed+gene
*.fp.bed
, but the transcript's gene symbol has been addedAs currently implemented this process can take a long time. With read counts in a few 10s of millions the duration is acceptable. However, for larger data sets, it may take too long, so we can limit the number of reads. When a number appears in the file name, it is the number of reads used in the analysis.
The footprint strengths are reported as raw reads. When comparing samples these should be normalized for sequencing or analysis depth by converting to reads per million. This is not done at the moment.
Also it can be important to normalize footprint strength to the expression level of the transcript, or better yet, to the exon that contains the footprint. The reason for this is that a footprint of, say 1000 reads, would be a strong footprint on a low-expressing gene, but should be considered weaker if found in a highly-expressed transcript. This normalization is not done as it requires RNA-Seq or other expression data.
The next natural step in the analysis is to connect miRNAs to footprints by identifying the potential binding miRNAs using seed-sequence matches or other techniques. Due the unreliability of these techniques at the time, this step is not incorporated into the pipeline.