Platform for Drug Discovery


BWA, GATK and snpEff + GE (human (hs37d5) only, PE)


Introduction


    This pipeline aligns short reads from NGS to a reference genome with BWA at first and visualize them by our genome browser, GenomeExplorer (GE) on the web. Then it calls genomic variants with GATK. Finally called variants including SNVs and short InDels are annotated by snpEff program which predicts influence level of amino acid changes. You can add annotations including SNP frequency made by this project from public archives. Supported organism of this pipeline is only human.


    Input formatFASTQ
    Library layoutPaired-end
    SpeciesHuman
    Execution timeAbout a few days (30M paired-end reads[76bp+76bp])


Table of contents

Inputs


    Raw NGS reads (FASTQ format, Paired-end): 1-10

    ・Check the dataset type explanation: fastq paired-end

Outputs


    Workflow
    Image
    1.Lists of SNV and INDEL (VCF file)
    Image
    Extraction of variation using GATK UnifiedGenotyper
    2.Statistics of SNV and INDEL
    Image
    Statistics obtained with snpEff
    3.Inter-individual similarity analysis (clustering)
    Image
    Clustering based on values quantified from variation information
    Inter-individual similarity analysis (principal component analysis)
    Image
    Principal component analysis based on values quantified from variation information
    4.Visualization with Genome Explorer
    Image
    Confirmation of variation region
    5. Tab-delimited file that is easy to filter the information
    Image
    Lists of SNV and INDEL


How to run this pipeline


    Also refer to ways of general operation on MASER (for uploading , for execution of pipeline ).

    1.When trying to upload input (here, FASTQ file), you can see the screen to enter required information, e.g., “Data Label” field, in which you can give freely label name. In optional information section, you should enter appropriate name in “Sample Name” field and choose “WGS” in “Application Type” field even if actual data is exome.

    Image

    2.This pipeline is categorized into “Resequencing.” Click “Analysis” button to specify input files.

    Image

    3.When sample name has not been entered in above step 1, you can directly enter input sample name in option page. When sample name has been entered in uploading process, you should leave “Sample Name” field “DEFAULT,” resulting in use of the sample name. If “Sample Name” field in uploading process is empty, “DEFAULT” in this step leads to use of the file name as sample name.

    Image

    4.If you need to extract variation from limited region, this “-L” option should be set.
    Example 1: if you restrict to only X chromosome, enter “X” in this field.
    Example 2: if you examine a region of chromosome 1 between 10000 and 20000, enter “1:10000-20000” in this field.

    Image

    5.Please set other parameters of GATK if necessary.
    Choose database of known SNV and INDEL. Default setting is the database made by Cell Innovation.

    Image

Result explanation


    1.The following is explanation of executed results that are shown as icons in below figure.

    Image
    2.Opening file 2 enclosed by red circle in the figure of step 1, you can see statistics of SNV and INDEL which were produced by snpEff.

    Image
    Statistics made by snpEff

    3.File 3 enclosed by red circle in the figure of step 1 includes figures of clustering and principal component analysis based on presence or absence of SNV to infer relatedness between samples. For quantifying SNV, homozygous substitution is counted as 2, heterozygous substitution as 1, and no substitution as 0. For clustering, group average method is used.

    Image
    Clustering by quantifying variation information
    Image
    Principal component analysis by quantifying variation information


    4.Files 1 enclosed by red circle in the figure of step 1 list SNV and INDEL information as VCF file. From the top, these are SNV.vcf, INDEL.vcf, SNV+INDEL.vcf, SNV.important.vcf (SNV that pass false positive filters and occur near protein coding or splicing sites), INDEL.important.vcf (INDEL that pass false positive filters and occur near protein coding or splicing sites).

    Image
    Variation information extracted with GATK UnifiedGenotyper

    5.To get the result and filter the variation, download file 5 enclosed by red circle in the figure of step 1. Then, open those by excel and use filter function.

    6.Using Genome Explorer, you can confirm actual variation that passed your filtration. Click file 4 enclosed by red circle in the figure of step 1, resulting in opening of the file with Genome Explorer. Click repeatedly “Zoom in” button until the display shows 1 base/px. Then, click “Search” tab, choose chromosome, and enter its location to be checked. Thus, you can move the variation site.

    Image
    Confirmation of variation site


Related Information