BWA, GATK and snpEff + GE (human (hs37d5) only, PE)
This pipeline aligns short reads from NGS to a reference genome with BWA at first and visualize them by our genome browser, GenomeExplorer (GE) on the web. Then it calls genomic variants with GATK. Finally called variants including SNVs and short InDels are annotated by snpEff program which predicts influence level of amino acid changes. You can add annotations including SNP frequency made by this project from public archives. Supported organism of this pipeline is only human.
|Execution time||About a few days (30M paired-end reads[76bp+76bp])|
Table of contents
Raw NGS reads (FASTQ format, Paired-end): 1-10・Check the dataset type explanation: fastq paired-end
|1．Lists of SNV and INDEL (VCF file)||2．Statistics of SNV and INDEL|
|3．Inter-individual similarity analysis (clustering)||Inter-individual similarity analysis (principal component analysis)|
|4．Visualization with Genome Explorer||5. Tab-delimited file that is easy to filter the information|
How to run this pipeline
Also refer to ways of general operation on MASER (for uploading , for execution of pipeline ).
1．When trying to upload input (here, FASTQ file), you can see the screen to enter required information, e.g., “Data Label” field, in which you can give freely label name. In optional information section, you should enter appropriate name in “Sample Name” field and choose “WGS” in “Application Type” field even if actual data is exome.
2．This pipeline is categorized into “Resequencing.” Click “Analysis” button to specify input files.
3．When sample name has not been entered in above step 1, you can directly enter input sample name in option page. When sample name has been entered in uploading process, you should leave “Sample Name” field “DEFAULT,” resulting in use of the sample name. If “Sample Name” field in uploading process is empty, “DEFAULT” in this step leads to use of the file name as sample name.
4．If you need to extract variation from limited region, this “-L” option should be set.
Example 1: if you restrict to only X chromosome, enter “X” in this field.
Example 2: if you examine a region of chromosome 1 between 10000 and 20000, enter “1:10000-20000” in this field.
5．Please set other parameters of GATK if necessary.
Choose database of known SNV and INDEL. Default setting is the database made by Cell Innovation.
1．The following is explanation of executed results that are shown as icons in below figure.
2．Opening file 2 enclosed by red circle in the figure of step 1, you can see statistics of SNV and INDEL which were produced by snpEff.
3．File 3 enclosed by red circle in the figure of step 1 includes figures of clustering and principal component analysis based on presence or absence of SNV to infer relatedness between samples. For quantifying SNV, homozygous substitution is counted as 2, heterozygous substitution as 1, and no substitution as 0. For clustering, group average method is used.
4．Files 1 enclosed by red circle in the figure of step 1 list SNV and INDEL information as VCF file. From the top, these are SNV.vcf, INDEL.vcf, SNV+INDEL.vcf, SNV.important.vcf (SNV that pass false positive filters and occur near protein coding or splicing sites), INDEL.important.vcf (INDEL that pass false positive filters and occur near protein coding or splicing sites).
5．To get the result and filter the variation, download file 5 enclosed by red circle in the figure of step 1. Then, open those by excel and use filter function.
6．Using Genome Explorer, you can confirm actual variation that passed your filtration. Click file 4 enclosed by red circle in the figure of step 1, resulting in opening of the file with Genome Explorer. Click repeatedly “Zoom in” button until the display shows 1 base/px. Then, click “Search” tab, choose chromosome, and enter its location to be checked. Thus, you can move the variation site.
|SAMtools||(Original)||(NGS Surfer's wiki)|
|snpEff||(Original)||(NGS Surfer's wiki)|
|BWA||(Original)||(NGS Surfer's wiki)|