TopHat2, CuffLinks2 and CummeRbund + GE (SE) for SAGE
This pipeline use one of the standard tools in this field, TopHat-cufflinks. It will be useful for the data to analyze first time. For the detection of de novo splicing variants, this pipeline add GO annotations and report high frequent entries.
This pipeline requires reference genome sequences and gene annotation information. First, mapping is performed with TopHat, then using CuffLinks, expression of each gene and each transcript is quantified with new splicing variants detected. The expression level is statistically tested by using CuffDiff, and on the basis of the result, several graphs including clustering of expression pattern are generated by CummeRbund. Finally, GO terms that occur significantly more frequently within each cluster are extracted with GOseq, resulting in visualization using Revigo.
|Species||Human, Rat, Mouse, Drosophila|
|Execution time||About a few days|
Table of contents
Raw NGS reads (FASTQ format, Single-end): 2-30・Check the dataset type explanation: fastq (single-end)
|1．Analysis report (Genome Explorer)||Analysis report (CummeRbund Density Plot)|
|Analysis report (CummeRbund Dendrogram)||Analysis report (CummeRbund Scatter Plot)|
|Analysis report (CummeRbund Volcano Plot)||Analysis report (CummeRbund Cluster Plot)|
|Analysis report (GOseq report)||Analysis report (REVIGO)|
|Analysis report (expression table)||2．Information accompanying with TopHat-CuffLinks (splice junction)|
|Information accompanying with TopHat-CuffLinks (Insert)||Information accompanying with TopHat-CuffLinks (Deletion)|
How to run this pipeline
1．First, login Maser and display the Project page by clicking the tab. You can click “Create New Project” to generate a new project.
2．Give a name (let say “human SAGE-seq”) to the project, and click “O.K.”
3．Click the created project to open a page
4．After making a project, you can upload input files (here, FASTQ files). Click “Upload My Data.”
5．Choose file transfer protocols. Usually, click “HTTPS.”
6．Enter data label (here, “ERR030872.fastq”), and choose data type (here, “fastq (single-end)”) and a file to be uploaded (here, C:\ERR030872.fastq).
For file name of file to be uploaded, there are the following limitations.
・Don’t use space “ “, parenthesis “ ( ”, quotation “ ‘ “, and Japanese. You should use alphabet, numeric character, and underscore “_” in file name.
Also, there are several kinds of qualities for FASTQ file. If you do not know what kinds of qualities your files have, please first use FASTQC, which is a quality check tool. When “Sanger / Illumina 1.8” is shown in “Basic Statistics” section in FSTQC output, you can go to the next step. If “Illumina 1.3 - 1.7” is displayed, you must specify “--phred64-quals” for TopHat option, which is described below.
If you have more samples to be compared, you can upload additional FASTQ files by clicking “Add file,” doing same operation as described above, and finally clicking “Upload.”
7．Here, three data are uploaded.
Click “Select” button below each data for all three.
Since the order in which you click “Select” corresponds to sample order in output, please do in order in which you want to show samples in output.
8．The list of the selected files is displayed in a separate window. Click “Analysis” button.
9．Choose “SAGE-seq,” then do one pipeline out of right list. When there are several pipelines with same name, you should select latest version.
10．Scrolling down, you can see input file list. Click “Set option and run.”
11．The following is explanation of changeable options.
First, you must check reference name to be used(e.g., hg38 for human, mm10 for mouse). For annotation type, “refGene” or “ensGene” is available, but “refGene” is safer in most cases.
Enter sample name from the top (i.e., input1 sample name) in the order in which you chose files just now. When there are technical replicates, please give same name to those technical replicates. Sample name must start from alphabet and consist of alphabet, numeric character, underscore. If possible, the sample name should be 10 characters or less (there are some cases where characters behind 10th are not displayed in output files).
After confirming the other options, click “Run” button at the bottom of this page.
1．If you want to check result or progress status, please open the project page and click “Module Flow View.” You can see completed modules and modules under analysis.
2．In the Module Flow view, final result report is displayed as the rightmost icon (enclosed by red line). Click the icon to display the report.
3．By clicking the above icon enclosed by red line, mapping result resulted from TopHat is visualized using Genome Explorer at the top. If you want to see the result in a separate window, please click “In a new window.” For how to use Genome Explorer, see this page or this page .
4．Scrolling down, you can see histogram and box plot of gene expression level for each sample.
5．Scrolling down, you can see clustering result on the basis of similarity of gene expression pattern between samples. You can check whether closest distance is found between technical replicates or not.
6．Scrolling down, you can see scatter plot and volcano plot of gene expression level for all pairwise combinations between samples.
7．Scrolling down, you can see k-means clustering result of expression pattern of genes for which significant difference is detected between any pair of groups.
8．Links to significant GO for each cluster, which are extracted with GOseq, are displayed.
For instance, to display distinguishing GO list of cluster 3, click “cluster: (3).” Then, the GO list is displayed. The list consists of significantly over represented and under represented GO for which gene length bias were taken into consideration with GOseq. As background, GOseq uses GO of organism chosen in step 5. Therefore, in the cases where there are no data on focal organism and gene in GOseq database, no results of GOseq occur. So far, we have confirmed the results when RefGene annotation for human, mouse and fruit fly was chosen. If you want to use other organism or annotation, please send e-mail to us. We will consider it.
9．To be easy to understand distinguishing GO for each cluster, the result of GOseq is entered in REVIGO. Click “p-value < 0.001 Revigo” button in step 8.
The website of REVIGO is opened, and cluster-3 GO list comprising of GO with p-value less than 0.001 is found to be entered in REVIGO. Click “Start Revigo” on the bottom.
At that time, there are some cases where no over represented GO occur because of small number of contigs in the cluster. If so, go back one step and click “full Revigo” button. Then, all GO in the cluster are entered in REVIGO.
The figure below shows TreeMap of REVIGO.
10．While the upper half of this report shows expression result for gene level, the lower half of that do it for transcript level. If you focus on new splicing variants, you should check the result for transcript level. But, the result for transcript level shows a tendency to produce more false positive. So, if you want to give a priority to certainty, you should check the result for gene level.
11．The links to expression tables for gene and transcript level are respectively shown at the middle and bottom of this report page.
Click the link to download the tab-delimited text file and open the file using excel or other applications.
Opening the file, you can see “tracking id” in the leftmost column header. “tracking id” indicates ID of gene or transcript predicted by CuffLinks. When these genes or transcripts are located near already-known genes, those gene names are shown in the “gene short name” column. Also, chromosomal position and length of gene (or transcript) are described. So, if you need to check a gene that was expressed distinctively, you can see the chromosomal position by using Genome Explorer.
Looking at further to the right of the column, expression level (FPKM) of each sample resulted from CuffDiff is described. FPKM (fragment per 1kbp per 1 million mapped reads) is the corrected number of mapped reads which allows to compare between different samples and between different genes in principal (actually, in most cases, it is difficult to use just FPKM because of various biases).
- For SAGE-seq, one transcript corresponds to only one tag due to the principle of the library preparation protocol (in the protocol, after transcript is treated with restriction enzyme, only tags of the 3’ end are sequenced. For RNA-seq, however, multiple different reads are derived from one transcript because any piece of transcript breaking apart is sequenced).
- Thus, in this pipeline, there are no correction of gene length.
At the right column of the expression level, cluster numbers determined by CummeRbund are shown. These numbers correspond to those in figures of steps 7 and 8. At further to the right column, you can see results of all pairwise combinations for group comparison produced by CuffDiff.
In the rightmost three columns, the followings are listed: gene ID of refseq, geneID of Entrez and GO liking to gene ID
12．Filtering numbers in the “cluster” column (using excel function), you can get a gene list composed of genes showing an expression pattern in which you are interested.