NGSpeAnalysis Pipeline


Running NGSpeAnalysis Pipeline

This part will introduce how to use the pipeline to map the fastq DNA sequencing data to reference genome (hg19/build37), then produce SNPs and INDELs genotype calling of sequencing data and finally generate annotation results.

Input Files

The NGSpeAnalysis pipeline requires 3 inputs which are as follows:

1. Fastq File (download sample fastq files from here. You may also use your own fastq files as input.)

2. GATK resources (ftp://ftp.broadinstitute.org/pub/gsa/gatk_resources.tgz)

3. 00-All.vcf (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz)

Necessary software packages

BWA (http://sourceforge.net/projects/bio-bwa/files/)
BEDTools (http://code.google.com/p/bedtools/downloads/detail?name=BEDTools.v2.12.0.tar.gz)
GATK (ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTK-latest.tar.bz2)
Picard (http://sourceforge.net/projects/picard/files/)
ANNOVAR (http://www.openbioinformatics.org/annovar/annovar_download.html)
The reference resources and software packages can also be downloaded and installed by running install.sh except GATK and ANNOVAR, for their authors requested registration for the download

Hardare We Run the Script

Operation system: Red Hat Enterprise Linux (release 6.0)
CPU: Intel Xeon W3550 3.06 8MB/1066 QC
Memory: 8GB

Steps to run NGSpeAnalysis on Linux PC

1. Install BWA, GATK, Picard and ANNOVAR (BEDTools is optional) correctly on your computer according to their instruction. For example, we installed above packages in ~/Downloads/ (~ means your home directory). You may ask your IT helpdesk to help you on the installation.

2. Download the resource files as shown in the Input Files section from their following links. Unzip and copy them to your resource folder. For example, we downloaded them in ~/Downloads/inputfile.

$cd ~/Downloads/inputfile

$gunzip 00-All.vcf.gz

$tar zxvf gatk_resources.tgz

$mkdir ~/Downloads/resources

$cp ./* ../resources

3. Open the downloaded script with word processing software (gedit, wordpad or notepad++). Replace the file path of input files and software packages with the correct one according to your own computer's situation. For example, replace PATH_TO/resources with ~/Downloads/resources (all input file paths in our scripts are start with "PATH_TO", and all software packages in our script are start with "~/Downloads/").

4. Run the script under command line as follow (replace "~/Downloads" with the file path you saved the shell script).

$sh ~/Downloads/ngsPipe_for_LinuxPC.sh

Then after ~15 hours (for Linux PC) you will get your SNPs (suffix with "snps.filtered.vcf") and INDELs (suffix with "indels.filtered.vcf ") genotype callings.

5. Process multi-sample genotype callings on a master list and annotate the variations.

Delete "##" in the initial of the steps after the "Generate statistics on called SNPs" step in the script.

a. Generate master list (Detail instruction).

b. Annotation by using ANNOVAR.

Convert the file from VCF format to ANNOVAR input format (How to).
Annotate variations (How to).

Steps to run NGSpeAnalysis on HPC

1. Install BWA, GATK, Picard and ANNOVAR (BEDTools is optional) correctly on your computer according to their instruction. For example, we installed above packages in ~/Downloads/ (ie. your home directory).A Portable Batch System(pbs) environment is expected for scheduling the jobs. You may ask your IT helpdesk to help you on the installation.

2. Download the resource files as shown in the Input Files section from their following links. Unzip and copy them to your resource folder. For example, we downloaded them in ~/Downloads/inputfile.

$cd ~/Downloads/inputfile

$gunzip 00-All.vcf.gz

$tar zxvf gatk_resources.tgz

$mkdir ~/Downloads/resources

$cp ./* ../resources

3. Open the downloaded script with word processing software (gedit, wordpad or notepad++). Replace the file path of input files and software packages with the corresponding locations on your computer. For example, replace PATH_TO/resources in the pbs script with path to where your resource folder is located, which in the above case is /Downloads/resources. Similarly, file paths for input files and software packages should also be changed.

4.Run the script from the directory where the file is located.

$qsub ngsPipe_for_HPC

To check the status of the job, use:

$qstat

Run time is approximately 20 hours for running the job to completion Our HPC environment is based on a CentOS platform. For each Exome or Genome run 8 processor nodes were used (8X Intel Xeon E53XX 1.8-3.0 8MB/1066 QC; Memory: 8X 2GB). Tight coupling between the memory and processors exist. Along with the various intermediate files, the SNP calls and INDEL calls will be generated as output in the file path specified in Step 3.

Note:Multiple copies of the script can be made to run on multiple samples paralelly.

5. Process multi-sample genotype callings on a master list and annotate the variations (same as Linux part).

Running NGSpeAnalysis Pipeline

Input Files

Necessary software packages

Hardare We Run the Script

Steps to run NGSpeAnalysis on Linux PC

Steps to run NGSpeAnalysis on HPC