NGSC - FAQS - Downloading Data

Accessing Data via Website

Introduction
Access via PMACS HPC
Bulk Downloads
The Download Data Button
Where's My Investigation?
AAA-StudyInfo
The Manual Folder
The basic Folder
The Analysis Folder

1. Introduction

The Basics

There is a 'Download Data' link under the 'Results' menu on the website. This will lead to the download area for you/your PI's area. The site is password protected using your NGSC username and password. All files that the NGSC produces in the course of doing your experiment will be available here.

Data is accessed via the http protocol (i.e., any web browser), but can easily be downloaded in bulk using command line utilities such as {\tt wget} or curl. In addition, many GUI programs can also download from http. See the section on Data Files below to see how the files are organized.

Data Sharing

Our policy is to make all data for a PI available to all of her lab members.

Collaborators can get access to all of a PI's data by setting up an account as a member of the PI's lab (with permission of the PI of course!)

If a collaborator should only have access to a limited set of investigations, then the collaborating PI and relevant lab members should set up a set of standard PI and investigator accounts. We will then connect them to the appropriate investigations.

2. Access via PMACS HPC

If you have a PMACS HPC account, you can pick up the data on the HPC. The website is presenting exactly the same files that are on the HPC. To locate files on the HPC, replace the text https://ngsc.med.upenn.edu/Experiments in the URL with /project/ngsc_data/PI_INVESTIGATIONS to get the path on the HPC.

For example
https://ngsc.med.upenn.edu/Experiments/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/
is accessed as
/project/ngsc_data/PI_INVESTIGATIONS/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/

Important Note - if you are having the NGSC store files and charge your PMACS HPC account, then our copy of the files will have a group name like ngsc_PINAME_lab. When you get your copy of the files it is important to use cp to make a copy, then rm our original file. In this way, your copy of the file will have your standard group name. Doing this will simplify accounting for you later on. If you do a mv the file will retain the ngsc_PINAME_lab even though it is under your directly. You will still be charged, but it may be less clear how the charges are originating.

3. Bulk Downloads

At the moment we do not support ftp or Aspera connections for acquiring data.

Via HTTPS

If you have access to a command line, i.e., in Linux or Mac OS X, you can use the wget command as follows to, for example, download all of your FASTQ files (where you replace USER and PASSWORD with your username and password and PI and INVESTIGATION with your PI and the investigation of interest.)

wget --user=USER --password=PASSWORD -r -np \
https://ngsc.med.upenn.edu/Experiments/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/

The -r -np options are very important and should not be left out.

-np prevents wget from traveling up the hierarchy which can reach into the public data on the server (generating a huge download.)
-r makes the download recursive which means that it will download nested folders.

Via SSH

If you have an account on the PMACS/HPC, you can copy the data from your investigation folder into your own area on PMACS or scp or rsync it back to your personal or lab machine. Our folders are laid out at the HPC exactly as they appear on the web.

You can contact the PMACS/HPC folks here.

Pushing Data

Upon request the NGSC can push your data to your machine. You will need to provide us with the hostname and account information. We will then scp or rsync your data to that machine.

Via Illumina's BaseSpace

Runs done in February and March are not available on BaseSpace.

We use Illumina's cloud computing platform BaseSpace to monitor our runs. The full data for MiSeq runs is uploaded to BaseSpace and we can share the run with you if you have a BaseSpace account.

Information about BaseSpace can be found here..

4. The Download Data Button

The Download Data button on the website leads to web pages that provide direct access to all of the files for your PI. You will have to login again using your normal NGSC account username and password. Inside this directory will be one or more folders that correspond to the investigations your PI has us. Inside each investigation folder are another set of files and folders which are fairly standardized.

AAA-StudyInfo.xls/.txt these files describe the sample and any sequencing we have done for this experiment. They are update automatically.
AAA-ForPickUp sequencing data and perhaps analysis data placed here are available for a week after they are placed there. This will be FASTQ files for clients and investigations where the NGSC is not providing any analysis services.
Manual this folder contains our notes for the investigation as well as custom programs and small sets of data.
Analyis this folder contains more advanced standard and custom analyses. For example peak calls and motif finding results will be here. More details below in the various Pipeline sections of this document.
IGV when the NGSC is providing data analysis services files related to visualization with IGV will be located here.
basic this folder contains the raw and lightly processed data for the investigation. This is a temporary working location use when the NGSC provides analysis. Files in here are delete automatically.

5. Where's My Investigation?

There are two main reasons why an investigation may not appear when you click the Download Data button. First, it's an old investigation, or it is part of a collaboration with another PI.

If the investigation is old, try making the following change to the URL in your browser. Instead of `http://fgc.genomics.upenn.edu/Experiments-2/PI` replace `Experiments-2` with `Experiments-1`. Also bear in mind that some very old experiments are named using the invoice number, PI, and investigator, so you may have to pick through a few folders to find the one you want.

We have moved to a new webserver which should have both new and old investigations under the same path Experiments. However not all old data has been made visible (to reduce our disk footprint.) If data is missing, please contact us to determine why.

If you are a collaborator and need to find investigations from other PIs, try replacing your PI's name in the URL with the name of PI of the investigation.

If either of these do not work, please feel free to contact us.

6. AAA-StudyInfo

The AAA-StudyInfo files (in text or Excel) format contain a summary of the samples submitted for an investigation and any sequencing done for the samples.

Each sample is listed at least once. Since a sample may be sequenced in multiple lanes, a sample may be listed more than once, but each entry will have a different run or lane.

FASTQ and many other files are named by the run, lane, and barcode. This allows us to easily generate unambigious names that easily be linked back to a sample and other run meta data.

Here are the columns in the AAA-StudyInfo files.

INV_name
- the name of the investigation
- usually the same for all rows in the file
STUDY_name
- the name of the study (under the investigation)
- most investigations have a single study called 'Main'
- more complex investigations may have a few studies covering the phases or experiment types in the investigation.
ASSAY_name
- this is the type of measurement the sample is associated with.
- it may be as simple as 'RNA-Seq'
- ChIP-Seq assays are usually named after the antibody target
- Argonaut HITS-CLIP assays will usually indicate 'mRNA' or 'miRNA'
COND_name
- this is the name of the experimental condition
- it consists of one or more space-delimited words in alphabetical order
- in a complex experiment there may be multiple words
- each word usually represents a different independent variable
SAMP_id
- NGSC's internal numeric sample id
- these are unique and assigned when the sample is checked into our system
SAMP_sid
- the source id of the sample
- when applicable and properly set, these will allow informatics analyses to link samples from the same donor or source
SAMP_name
- this is the name of the sample, usually as specified by the investigator
- when samples are submitted via a plate the name may include the plate and well name
- well name looks like '[[A1]]' and will be located at the start of the sample name
SAMP_PlateWell
- the plate name and well location
- used when the sample was submitted or transferred to a plate
- blank otherwise.
SAMP_type
- indicates the molecule type
- usually 'RNA', 'DNA', 'cDNA', etc. for non-library samples
- 'RNA-Seq Library', 'ChIP-Seq Library' etc, for library samples
- when the NGSC makes libraries, there will be entries for both the RNA and RNA-Seq library, for example.
SAMP_subd
- effective date the sample was submitted.
SAMP_Bioa
- date when the sample was bioAnalzed by the NGSC (if that was done)
SAMP_Bad
- 0 - sample is considered to be good (as far as we know)
- 1 - sample is bad.
SAMP_Status
- position of the sample in our queue
RULA_Run
- run number if sample was sequenced.
- for example 'FGC1234'
- RNA and other non-library samples should always have a 'NULL' value in this column.
- a library that has not yet been sequenced will also have a 'NULL' value.
RULA_Lane
- lane of the run if sample was sequenced
- members of a pool will have the same RULA_Run and RULA_Lane
- RNA and other non-library samples should always have a 'NULL' value in this column.
- a library that has not yet been sequenced will also have a 'NULL' value.
RULA_Barcode
- this is the barcode associated with the sample
- non-library samples may have a barcode to indicate how we plan to barcode the libraries.
- library samples will have a barcode even if they have not been sequenced.
RULA_Status
- the status of the run and lane
- '++++' - everything is good.
- 'T' indicates a test run or lane - this will occur in the 1st or 3rd position
- 'F' indicates a failed run or lane - NGSC accepting financial responsibility
- 'I' indicates a failed run or lane - Illumina accepting financial responsibility
- 'U' indicates a failed run or lane - Client (user) accepting financial responsibility
- 'R' indicates the run was rebyhed
- 'S' lane or run is suspicous but a final status has not been determined.
- failures usually do not produce much if any data, or they may produce damaged data
RULA_Geometry
- has the form - Sequencer Model : Sequencer Mode : Sequencer Serial number / run geometry detail / run mode
- run geometry detail indicates the exact length of read 1, index 1, index 2, and read 2.
  - index 2 may be missing if not performed
  - read 2 may be missing if not performed
  - read 1 and read 2 may include an extra base, i.e., 101 instead of 100, in order to improve the quality of the 100th base.
  - the extra cycle is generally not provided in FASTQs
  - when the run is longer than what was requested for a sample, the extra data is generally not provided.
- Generally MiSeq runs are test runs to assess pool balance and library quality.
- may be needed for submitting data
- Note that the FASTQ files provided for the run may include shorter reads than indicated here.
- for example a ChIP-Seq sample that was supposed to get 50SR may be sequenced on a 100SR run.
RULA_RTA
- the version of the Run Time Analysis software on the sequencer
- may be needed for submitting data
RULA_HCS
- the version of the HiSeq Control software on the sequencer
- may be needed for submitting data
TRAK_N
- number of tracks loaded in the genome browser for this sample

7. The Manual Folder

This folder contains the notes, commands, and extra data used to process the data for the investigation. We maintain these folders on a separate server in the UPenn Box system for easy sharing, editing, and version control. We then sync them into the website area when there are changes.

8. The basic Folder

The basic folder is no longer considered a useful folder. Data in here is automatically deleted shortly after it is generated.

The basic folder contains files that hold the basic 'raw' sequence data, any trimming of adapter and/or low quality sequence as well as the alignments to one or more genomes. It also contains the visualization data that appears in the TessLA browser for this data, e.g., the USHP tracks. The contents of this directory have evolved over the time the NGSC has been open, so older experiments may have a slightly different set of files that new ones.

Practice Prior to 2017-12-15

Fastq - these are the 'raw' data from the sequencer.
- Raw is in quotes as the actual output of the sequencer is in BCL format which we convert to FASTQ.
- The files are usually (but not always) compressed with gzip.
- FASTQ files will be uncompressed during the initial analysis then compressed again later on.
- The files are named by the RUN, LANE, END (for paired-end sequencing) and BARCODE of the data. For example FGC0503_s_1_1_AGGCAGAA.fastq.gz is the data for runFGC0503, lane 1, end 1, and barcodeAGGCAGAA`.
- You may occasionally see Undetermined or OTHER instead of a barcode. This happens when we either do not know the barcodes or there is a discrepancy between the reported barcodes and what we see in the pool. When OTHER is present, we have probably split the Undetermined file using the observed barcodes.
fastqc - Often we run the utility FASTQC to assess the basic statistics of the sequenced generated to look for adapter, base qualities, CG bias, repeated sequence etc. The output of this program consists of both an HTML report and other pictures and text files.
Bowtie - output of bowtie aligner including alignments as well as logs and statistics of the alignments.
BedFiles - alignments converted to BED or other similar formats for visualization in the TessLA browser.
BioaResults - contains the web pages to display BioA results.

Previous Practice

In the early years, 2009 to 2011 or so, Illumina's software did not produce FASTQ files and their alignment program, ELAND, did not produce BED or SAM/BAM files. Older investigations will therefor have a different set of directories and file types. The sequence files can be converted and the alignments can be converted or, better yet, re-aligned to produce data in current formats.

Solexa This folder contains the sequencing (sequence.txt) and alignment data (export.txt). Sequence.txt files can be converted to FASTQ for data submission or updating analyses.
Config - These can generally be ignored, contains meta data about sequencing and samples.
Export - contains processed alignment data.

9. The Analysis Folder

The Analysis folder contains all of the secondary analyses for an investigation. It is evolving as we add or alter pipelines.

HOMER

Files related to ChIP-Seq peak-calling using the HOMER tool.

RUM

Files related to RNA-Seq alignment and quantification using the RUM package. Folders generally correspond to individual run/lane/barcodes, perhaps with different alignment protocols.

bs_seeker

Files related to BIS-Seq alignment and quantification. Folders generally correspond to individual run/lane/barcodes, perhaps with different alignment protocols. The will also be folders that correspond to samples as for whole-genome sequencing a large number of lanes are required for good coverage.

ReadRedundancy

Statistics about the level of read duplication in samples.

TargetGenes

Files that connect genome features, usually ChIP-Seq peaks, to nearby genes.

DiffExp

Files related to differential-expression analysis of RNA-Seq experiments. Each folder corresponds to one or more comparisons with a particular set of data and/or analysis parameters. Often there will be a folder with a preliminary analysis, and another with bad samples left out and/or a specific set of comparisons.