Accessing Data via Website

Table Of Contents

  1. Introduction
  2. Access via PMACS HPC
  3. Bulk Downloads
  4. The Download Data Button
  5. Where's My Investigation?
  6. AAA-StudyInfo
  7. The Manual Folder
  8. The basic Folder
  9. The Analysis Folder

1.  Introduction

The Basics

There is a 'Download Data' link under the 'Results' menu on the website. This will lead to the download area for you/your PI's area. The site is password protected using your NGSC username and password. All files that the NGSC produces in the course of doing your experiment will be available here.

Data is accessed via the http protocol (i.e., any web browser), but can easily be downloaded in bulk using command line utilities such as {\tt wget} or curl. In addition, many GUI programs can also download from http. See the section on Data Files below to see how the files are organized.

Data Sharing

Our policy is to make all data for a PI available to all of her lab members.

Collaborators can get access to all of a PI's data by setting up an account as a member of the PI's lab (with permission of the PI of course!)

If a collaborator should only have access to a limited set of investigations, then the collaborating PI and relevant lab members should set up a set of standard PI and investigator accounts. We will then connect them to the appropriate investigations.

2.  Access via PMACS HPC

If you have a PMACS HPC account, you can pick up the data on the HPC. The website is presenting exactly the same files that are on the HPC. To locate files on the HPC, replace the text https://ngsc.med.upenn.edu/Experiments in the URL with /project/ngsc_data/PI_INVESTIGATIONS to get the path on the HPC.

For example
https://ngsc.med.upenn.edu/Experiments/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/
is accessed as
/project/ngsc_data/PI_INVESTIGATIONS/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/

Important Note - if you are having the NGSC store files and charge your PMACS HPC account, then our copy of the files will have a group name like ngsc_PINAME_lab. When you get your copy of the files it is important to use cp to make a copy, then rm our original file. In this way, your copy of the file will have your standard group name. Doing this will simplify accounting for you later on. If you do a mv the file will retain the ngsc_PINAME_lab even though it is under your directly. You will still be charged, but it may be less clear how the charges are originating.

3.  Bulk Downloads

At the moment we do not support ftp or Aspera connections for acquiring data.

Via HTTPS

If you have access to a command line, i.e., in Linux or Mac OS X, you can use the wget command as follows to, for example, download all of your FASTQ files (where you replace USER and PASSWORD with your username and password and PI and INVESTIGATION with your PI and the investigation of interest.)

wget --user=USER --password=PASSWORD -r -np \
https://ngsc.med.upenn.edu/Experiments/PI/INVESTIGATION/AAA-ForPickUp/FASTQ/

The -r -np options are very important and should not be left out.

Via SSH

If you have an account on the PMACS/HPC, you can copy the data from your investigation folder into your own area on PMACS or scp or rsync it back to your personal or lab machine. Our folders are laid out at the HPC exactly as they appear on the web.

You can contact the PMACS/HPC folks here.

Pushing Data

Upon request the NGSC can push your data to your machine. You will need to provide us with the hostname and account information. We will then scp or rsync your data to that machine.

Via Illumina's BaseSpace

Runs done in February and March are not available on BaseSpace.

We use Illumina's cloud computing platform BaseSpace to monitor our runs. The full data for MiSeq runs is uploaded to BaseSpace and we can share the run with you if you have a BaseSpace account.

Information about BaseSpace can be found here..

4.  The Download Data Button

The Download Data button on the website leads to web pages that provide direct access to all of the files for your PI. You will have to login again using your normal NGSC account username and password. Inside this directory will be one or more folders that correspond to the investigations your PI has us. Inside each investigation folder are another set of files and folders which are fairly standardized.

5.  Where's My Investigation?

There are two main reasons why an investigation may not appear when you click the Download Data button. First, it's an old investigation, or it is part of a collaboration with another PI.

If the investigation is old, try making the following change to the URL in your browser. Instead of `http://fgc.genomics.upenn.edu/Experiments-2/PI` replace `Experiments-2` with `Experiments-1`. Also bear in mind that some very old experiments are named using the invoice number, PI, and investigator, so you may have to pick through a few folders to find the one you want.

We have moved to a new webserver which should have both new and old investigations under the same path Experiments. However not all old data has been made visible (to reduce our disk footprint.) If data is missing, please contact us to determine why.

If you are a collaborator and need to find investigations from other PIs, try replacing your PI's name in the URL with the name of PI of the investigation.

If either of these do not work, please feel free to contact us.

6.  AAA-StudyInfo

The AAA-StudyInfo files (in text or Excel) format contain a summary of the samples submitted for an investigation and any sequencing done for the samples.

Each sample is listed at least once. Since a sample may be sequenced in multiple lanes, a sample may be listed more than once, but each entry will have a different run or lane.

FASTQ and many other files are named by the run, lane, and barcode. This allows us to easily generate unambigious names that easily be linked back to a sample and other run meta data.

Here are the columns in the AAA-StudyInfo files.

7.  The Manual Folder

This folder contains the notes, commands, and extra data used to process the data for the investigation. We maintain these folders on a separate server in the UPenn Box system for easy sharing, editing, and version control. We then sync them into the website area when there are changes.

8.  The basic Folder

The basic folder is no longer considered a useful folder. Data in here is automatically deleted shortly after it is generated.

The basic folder contains files that hold the basic 'raw' sequence data, any trimming of adapter and/or low quality sequence as well as the alignments to one or more genomes. It also contains the visualization data that appears in the TessLA browser for this data, e.g., the USHP tracks. The contents of this directory have evolved over the time the NGSC has been open, so older experiments may have a slightly different set of files that new ones.

Practice Prior to 2017-12-15

Previous Practice

In the early years, 2009 to 2011 or so, Illumina's software did not produce FASTQ files and their alignment program, ELAND, did not produce BED or SAM/BAM files. Older investigations will therefor have a different set of directories and file types. The sequence files can be converted and the alignments can be converted or, better yet, re-aligned to produce data in current formats.

9.  The Analysis Folder

The Analysis folder contains all of the secondary analyses for an investigation. It is evolving as we add or alter pipelines.

HOMER

Files related to ChIP-Seq peak-calling using the HOMER tool.

RUM

Files related to RNA-Seq alignment and quantification using the RUM package. Folders generally correspond to individual run/lane/barcodes, perhaps with different alignment protocols.

bs_seeker

Files related to BIS-Seq alignment and quantification. Folders generally correspond to individual run/lane/barcodes, perhaps with different alignment protocols. The will also be folders that correspond to samples as for whole-genome sequencing a large number of lanes are required for good coverage.

ReadRedundancy

Statistics about the level of read duplication in samples.

TargetGenes

Files that connect genome features, usually ChIP-Seq peaks, to nearby genes.

DiffExp

Files related to differential-expression analysis of RNA-Seq experiments. Each folder corresponds to one or more comparisons with a particular set of data and/or analysis parameters. Often there will be a folder with a preliminary analysis, and another with bad samples left out and/or a specific set of comparisons.