NGSC - FAQs |
Next-Generation Sequencing Core Perelman School of Medicine | University of Pennsylvania |
There is a 'Download Data' link under the 'Results' menu on the website. This will lead to the download area for you/your PI's area. The site is password protected using your NGSC username and password. All files that the NGSC produces in the course of doing your experiment will be available here.
Data is accessed via the http protocol (i.e., any web browser), but can easily be downloaded in bulk using command line utilities such as {\tt wget} or curl
. In addition, many GUI programs can also download from http. See the section on Data Files below to see how the files are organized.
Our policy is to make all data for a PI available to all of her lab members.
Collaborators can get access to all of a PI's data by setting up an account as a member of the PI's lab (with permission of the PI of course!)
If a collaborator should only have access to a limited set of investigations, then the collaborating PI and relevant lab members should set up a set of standard PI and investigator accounts. We will then connect them to the appropriate investigations.
If you have a PMACS HPC account, you can pick up the data on the HPC. The website is presenting exactly the same files that are on the HPC. To locate files on the HPC, replace the text https://ngsc.med.upenn.edu/Experiments
in the URL with /project/ngsc_data/PI_INVESTIGATIONS
to get the path on the HPC.
For example
https://ngsc.med.upenn.edu/Experiments/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/
is accessed as
/project/ngsc_data/PI_INVESTIGATIONS/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/
Important Note - if you are having the NGSC store files and charge your PMACS HPC account, then our copy of the files will have a group name like ngsc_PINAME_lab
. When you get your copy of the files it is important to use cp
to make a copy, then rm
our original file. In this way, your copy of the file will have your standard group name. Doing this will simplify accounting for you later on. If you do a mv
the file will retain the ngsc_PINAME_lab
even though it is under your directly. You will still be charged, but it may be less clear how the charges are originating.
At the moment we do not support ftp or Aspera connections for acquiring data.
If you have access to a command line, i.e., in Linux or Mac OS X, you can use the wget
command as follows to, for example, download all of your FASTQ files (where you replace USER
and PASSWORD
with your username and password and PI
and INVESTIGATION
with your PI and the investigation of interest.)
wget --user=USER --password=PASSWORD -r -np \
https://ngsc.med.upenn.edu/Experiments/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/
The -r -np
options are very important and should not be left out.
-np
prevents wget from traveling up the hierarchy which can reach into the
public data on the server (generating a huge download.)
-r
makes the download recursive which means that it will download nested
folders.
If you have an account on the PMACS/HPC, you can copy the data from your investigation folder into your own area on PMACS or scp
or rsync
it back to your personal or lab machine. Our folders are laid out at the HPC exactly as they appear on the web.
You can contact the PMACS/HPC folks here.
Upon request the NGSC can push your data to your machine. You will need to provide us with the hostname and account information. We will then scp
or rsync
your data to that machine.
Runs done in February and March are not available on BaseSpace.
We use Illumina's cloud computing platform BaseSpace to monitor our runs. The full data for MiSeq runs is uploaded to BaseSpace and we can share the run with you if you have a BaseSpace account.
Information about BaseSpace can be found here..
The Download Data button on the website leads to web pages that provide direct access to all of the files for your PI. You will have to login again using your normal NGSC account username and password. Inside this directory will be one or more folders that correspond to the investigations your PI has us. Inside each investigation folder are another set of files and folders which are fairly standardized.
AAA-StudyInfo.xls/.txt
these files describe the sample and any sequencing we have
done for this experiment. They are update automatically.
AAA-ForPickUp
sequencing data and perhaps analysis data placed here are available for a week after they are placed there. This will be FASTQ files for clients and investigations where the NGSC is not providing any analysis services.
Manual
this folder contains our notes for the investigation as well as custom
programs and small sets of data.
Analyis
this folder contains more advanced standard and custom analyses. For example
peak calls and motif finding results will be here. More details below in the various
Pipeline sections of this document.
IGV
when the NGSC is providing data analysis services files related to visualization with IGV will be located here.
basic
this folder contains the raw and lightly processed data for the investigation.
This is a temporary working location use when the NGSC provides analysis. Files in here are delete automatically.
There are two main reasons why an investigation may not appear when you click the Download Data button. First, it's an old investigation, or it is part of a collaboration with another PI.
We have moved to a new webserver which should have both new and old investigations under the same path Experiments
. However not all old data has been made visible (to reduce our disk footprint.) If data is missing, please contact us to determine why.
If you are a collaborator and need to find investigations from other PIs, try replacing your PI's name in the URL with the name of PI of the investigation.
If either of these do not work, please feel free to contact us.
The AAA-StudyInfo
files (in text or Excel) format contain a summary of the samples submitted for an investigation and any sequencing done for the samples.
Each sample is listed at least once. Since a sample may be sequenced in multiple lanes, a sample may be listed more than once, but each entry will have a different run or lane.
FASTQ and many other files are named by the run, lane, and barcode. This allows us to easily generate unambigious names that easily be linked back to a sample and other run meta data.
Here are the columns in the AAA-StudyInfo
files.
INV_name
STUDY_name
ASSAY_name
COND_name
SAMP_id
SAMP_sid
SAMP_name
SAMP_PlateWell
SAMP_type
SAMP_subd
SAMP_Bioa
SAMP_Bad
SAMP_Status
RULA_Run
RULA_Lane
RULA_Run
and RULA_Lane
RULA_Barcode
RULA_Status
RULA_Geometry
Sequencer Model
: Sequencer Mode
: Sequencer Serial number
/ run geometry detail
/ run mode
run geometry detail
indicates the exact length of read 1
, index 1
, index 2
, and read 2
.
index 2
may be missing if not performedread 2
may be missing if not performedread 1
and read 2
may include an extra base, i.e., 101 instead of 100, in order to improve the quality of the 100th base.RULA_RTA
RULA_HCS
TRAK_N
This folder contains the notes, commands, and extra data used to process the data for the investigation. We maintain these folders on a separate server in the UPenn Box system for easy sharing, editing, and version control. We then sync them into the website area when there are changes.
The basic
folder is no longer considered a useful folder. Data in here is automatically deleted shortly after it is generated.
The basic folder contains files that hold the basic 'raw' sequence data, any trimming of adapter and/or low quality sequence as well as the alignments to one or more genomes. It also contains the visualization data that appears in the TessLA browser for this data, e.g., the USHP tracks. The contents of this directory have evolved over the time the NGSC has been open, so older experiments may have a slightly different set of files that new ones.
Fastq
- these are the 'raw' data from the sequencer.
Raw is in quotes as the actual output of the sequencer is in BCL format which we convert to FASTQ.
The files are usually (but not always) compressed with gzip.
FASTQ files will be uncompressed during the initial analysis then compressed again later on.
The files are named by the RUN, LANE, END (for paired-end sequencing) and BARCODE
of the data. For example FGC0503_s_1_1_AGGCAGAA.fastq.gz is the data for run
FGC0503, lane 1, end 1, and barcode
AGGCAGAA`.
You may occasionally see Undetermined
or OTHER
instead of a barcode. This
happens when we either do not know the barcodes or there is a discrepancy between
the reported barcodes and what we see in the pool. When OTHER
is present, we
have probably split the Undetermined file using the observed barcodes.
fastqc
- Often we run the utility FASTQC to assess the basic statistics of the sequenced
generated to look for adapter, base qualities, CG bias, repeated sequence etc. The
output of this program consists of both an HTML report and other pictures and text
files.
Bowtie
- output of bowtie aligner including alignments as well as logs and
statistics of the alignments.
BedFiles
- alignments converted to BED or other similar formats for visualization in
the TessLA browser.
BioaResults
- contains the web pages to display BioA results.
In the early years, 2009 to 2011 or so, Illumina's software did not produce FASTQ
files and their alignment program, ELAND
, did not produce BED or SAM/BAM files.
Older investigations will therefor have a different set of directories and file
types. The sequence files can be converted and the alignments can be converted or,
better yet, re-aligned to produce data in current formats.
Solexa
This folder contains the sequencing (sequence.txt
) and alignment data
(export.txt
). Sequence.txt files can be converted to FASTQ for data submission or
updating analyses.
Config
- These can generally be ignored, contains meta data about sequencing and
samples.
Export
- contains processed alignment data.
The Analysis folder contains all of the secondary analyses for an investigation. It is evolving as we add or alter pipelines.
Files related to ChIP-Seq peak-calling using the HOMER tool.
Files related to RNA-Seq alignment and quantification using the RUM package. Folders generally correspond to individual run/lane/barcodes, perhaps with different alignment protocols.
Files related to BIS-Seq alignment and quantification. Folders generally correspond to individual run/lane/barcodes, perhaps with different alignment protocols. The will also be folders that correspond to samples as for whole-genome sequencing a large number of lanes are required for good coverage.
Statistics about the level of read duplication in samples.
Files that connect genome features, usually ChIP-Seq peaks, to nearby genes.
Files related to differential-expression analysis of RNA-Seq experiments. Each folder corresponds to one or more comparisons with a particular set of data and/or analysis parameters. Often there will be a folder with a preliminary analysis, and another with bad samples left out and/or a specific set of comparisons.