Platform for Drug Discovery


TopHat2, CuffLinks2 and CummeRbund + GE (SE) for SAGE


Introduction


    This pipeline use one of the standard tools in this field, TopHat-cufflinks. It will be useful for the data to analyze first time. For the detection of de novo splicing variants, this pipeline add GO annotations and report high frequent entries.


    This pipeline requires reference genome sequences and gene annotation information. First, mapping is performed with TopHat, then using CuffLinks, expression of each gene and each transcript is quantified with new splicing variants detected. The expression level is statistically tested by using CuffDiff, and on the basis of the result, several graphs including clustering of expression pattern are generated by CummeRbund. Finally, GO terms that occur significantly more frequently within each cluster are extracted with GOseq, resulting in visualization using Revigo.


    Input formatFASTQ
    Library layoutSingle-end
    SpeciesHuman, Rat, Mouse, Drosophila
    Execution timeAbout a few days

Table of contents


Inputs


    Raw NGS reads (FASTQ format, Single-end): 2-30

    ・Check the dataset type explanation: fastq (single-end)

Outputs


    Workflow
    Image
    1.Analysis report (Genome Explorer)
    Image
    Visualization of TopHat-mapping resul using Genome Explorer
    Analysis report (CummeRbund Density Plot)
    Image
    Distribution of expression level for each sample
    Analysis report (CummeRbund Dendrogram)
    Image
    Similarity between samples
    Analysis report (CummeRbund Scatter Plot)
    Image
    Scatter plot of sample pairwise combination
    Analysis report (CummeRbund Volcano Plot)
    Image
    Volcano Plot of sample pairwise combination
    Analysis report (CummeRbund Cluster Plot)
    Image
    Clustering result
    Analysis report (GOseq report)
    Image
    HTML report after GO annotation was completed
    Analysis report (REVIGO)
    Image
    GOseq→REVIGO visualization of GO terms
    Analysis report (expression table)
    Image
    List of expression level (tsv file (Viewable with Excel))
    2.Information accompanying with TopHat-CuffLinks (splice junction)
    Image
    List of splice junction position resulted from TopHat (BED format)
    Information accompanying with TopHat-CuffLinks (Insert)
    Image
    List of insertion position resulted from TopHat (BED format)
    Information accompanying with TopHat-CuffLinks (Deletion)
    Image
    List of deletion position resulted from TopHat (BED format)

How to run this pipeline


    1.First, login Maser and display the Project page by clicking the tab. You can click “Create New Project” to generate a new project.
    Image


    2.Give a name (let say “human SAGE-seq”) to the project, and click “O.K.”
    Image


    3.Click the created project to open a page
    Image


    4.After making a project, you can upload input files (here, FASTQ files). Click “Upload My Data.”
    Image


    5.Choose file transfer protocols. Usually, click “HTTPS.”
    Image


    6.Enter data label (here, “ERR030872.fastq”), and choose data type (here, “fastq (single-end)”) and a file to be uploaded (here, C:\ERR030872.fastq).
    Image


    For file name of file to be uploaded, there are the following limitations.
    Don’t use space “ “, parenthesis “ ( ”, quotation “ ‘ “, and Japanese. You should use alphabet, numeric character, and underscore “_” in file name.

    Also, there are several kinds of qualities for FASTQ file. If you do not know what kinds of qualities your files have, please first use FASTQC, which is a quality check tool. When “Sanger / Illumina 1.8” is shown in “Basic Statistics” section in FSTQC output, you can go to the next step. If “Illumina 1.3 - 1.7” is displayed, you must specify “--phred64-quals” for TopHat option, which is described below.

    If you have more samples to be compared, you can upload additional FASTQ files by clicking “Add file,” doing same operation as described above, and finally clicking “Upload.”


    7.Here, three data are uploaded.
    Click “Select” button below each data for all three.
    Since the order in which you click “Select” corresponds to sample order in output, please do in order in which you want to show samples in output.
    Image


    8.The list of the selected files is displayed in a separate window. Click “Analysis” button.
    Image


    9.Choose “SAGE-seq,” then do one pipeline out of right list. When there are several pipelines with same name, you should select latest version.
    Image


    10.Scrolling down, you can see input file list. Click “Set option and run.”

    Image


    11.The following is explanation of changeable options.
    First, you must check reference name to be used(e.g., hg38 for human, mm10 for mouse). For annotation type, “refGene” or “ensGene” is available, but “refGene” is safer in most cases.
    Enter sample name from the top (i.e., input1 sample name) in the order in which you chose files just now. When there are technical replicates, please give same name to those technical replicates. Sample name must start from alphabet and consist of alphabet, numeric character, underscore. If possible, the sample name should be 10 characters or less (there are some cases where characters behind 10th are not displayed in output files).
    After confirming the other options, click “Run” button at the bottom of this page.
    Image


Result explanation


    1.If you want to check result or progress status, please open the project page and click “Module Flow View.” You can see completed modules and modules under analysis.
    Image


    2.In the Module Flow view, final result report is displayed as the rightmost icon (enclosed by red line). Click the icon to display the report.
    Image


    3.By clicking the above icon enclosed by red line, mapping result resulted from TopHat is visualized using Genome Explorer at the top. If you want to see the result in a separate window, please click “In a new window.” For how to use Genome Explorer, see this page or this page .
    Image


    4.Scrolling down, you can see histogram and box plot of gene expression level for each sample.
    Image
    Image


    5.Scrolling down, you can see clustering result on the basis of similarity of gene expression pattern between samples. You can check whether closest distance is found between technical replicates or not.
    Image


    6.Scrolling down, you can see scatter plot and volcano plot of gene expression level for all pairwise combinations between samples.
    Image
    Scatter plot of all pairwise combinations between samples

    Image
    Volcano Plot of all pairwise combinations between samples



    7.Scrolling down, you can see k-means clustering result of expression pattern of genes for which significant difference is detected between any pair of groups.
    Image


    8.Links to significant GO for each cluster, which are extracted with GOseq, are displayed.
    Image
    For instance, to display distinguishing GO list of cluster 3, click “cluster: (3).” Then, the GO list is displayed. The list consists of significantly over represented and under represented GO for which gene length bias were taken into consideration with GOseq. As background, GOseq uses GO of organism chosen in step 5. Therefore, in the cases where there are no data on focal organism and gene in GOseq database, no results of GOseq occur. So far, we have confirmed the results when RefGene annotation for human, mouse and fruit fly was chosen. If you want to use other organism or annotation, please send e-mail to us. We will consider it.
    Image


    9.To be easy to understand distinguishing GO for each cluster, the result of GOseq is entered in REVIGO. Click “p-value < 0.001 Revigo” button in step 8.

    Image
    The website of REVIGO is opened, and cluster-3 GO list comprising of GO with p-value less than 0.001 is found to be entered in REVIGO. Click “Start Revigo” on the bottom.
    At that time, there are some cases where no over represented GO occur because of small number of contigs in the cluster. If so, go back one step and click “full Revigo” button. Then, all GO in the cluster are entered in REVIGO.
    The figure below shows TreeMap of REVIGO.
    Image

    10.While the upper half of this report shows expression result for gene level, the lower half of that do it for transcript level. If you focus on new splicing variants, you should check the result for transcript level. But, the result for transcript level shows a tendency to produce more false positive. So, if you want to give a priority to certainty, you should check the result for gene level.


    11.The links to expression tables for gene and transcript level are respectively shown at the middle and bottom of this report page.

    Image
    Click the link to download the tab-delimited text file and open the file using excel or other applications.
    Opening the file, you can see “tracking id” in the leftmost column header. “tracking id” indicates ID of gene or transcript predicted by CuffLinks. When these genes or transcripts are located near already-known genes, those gene names are shown in the “gene short name” column. Also, chromosomal position and length of gene (or transcript) are described. So, if you need to check a gene that was expressed distinctively, you can see the chromosomal position by using Genome Explorer.
    Image

Looking at further to the right of the column, expression level (FPKM) of each sample resulted from CuffDiff is described. FPKM (fragment per 1kbp per 1 million mapped reads) is the corrected number of mapped reads which allows to compare between different samples and between different genes in principal (actually, in most cases, it is difficult to use just FPKM because of various biases).

  • For SAGE-seq, one transcript corresponds to only one tag due to the principle of the library preparation protocol (in the protocol, after transcript is treated with restriction enzyme, only tags of the 3’ end are sequenced. For RNA-seq, however, multiple different reads are derived from one transcript because any piece of transcript breaking apart is sequenced).
  • Thus, in this pipeline, there are no correction of gene length.
Image


At the right column of the expression level, cluster numbers determined by CummeRbund are shown. These numbers correspond to those in figures of steps 7 and 8. At further to the right column, you can see results of all pairwise combinations for group comparison produced by CuffDiff.
Image


In the rightmost three columns, the followings are listed: gene ID of refseq, geneID of Entrez and GO liking to gene ID
Image


12.Filtering numbers in the “cluster” column (using excel function), you can get a gene list composed of genes showing an expression pattern in which you are interested.

Related information