Platform for Drug Discovery


Concept of Maser


Table of contents


Advantage of the introduction of GUI analysis system

Wall of CUI

  • One of the remarkable difference of bioinformatician and non-bioinformatician is preference for the CUI, Charactor User Interface, like command line OS LINUX and script languages shell scripts, Perl, R, Python, Ruby and etc.
  • CUI is compatible with task repeatability, task automation and batch execution, but Learning cost for them are not low.
  • We believe it is difficult that all the researchers and technical staffs who must treat huge data like data from next generation sequencers pay the learning CUI cost.

Bridge of GUI and CUI

  • Our pipeline system aimed to become a bridge of Bioinformatician and non-bioinformatician.
  • More concretely this system generates some command line scripts for analyses with GUI operation and run them with appropriate order.
  • On another front, a GUI pipeline is easily generated from a command line script.
  • Most users operate this system without being conscious of scripts running on the back.
  • Since most of users do not need overcome the wall of learning cost of CUI, non-bioinformatician user can subjectively analyze their own data and, user can also try option value changes or tool replacement described below without hesitation.
  • On the other hand, bioinformaticians are relieved from almost routine work analyses, so they can concentrate on the development of new methods, analysis flow, tools and algorithms.
  • In other words, this system is designed to assume the following, and pipelines are also produced and are maintained under the same concept.
    1. Users of this system include people who are not called bioinformatician and do not have to be familiar with CUI and most of them do not aware of slight differences between tools and data formats.
    2. Pipeline register and those who maintain pipelines of this system are so-called bioinformatician who must be familiar with CUI and must write some scripts.

About auto generated command line script

  • The generated scripts are useful in the following situation.
    1. When a user encounters an error and he or she consults bioinformatics experts who do not familiar with such GUI analysis environment.
    2. A user wants to repeat or customize the analysis flow in his or her local environment.
    3. A user wants to check if expected commands were executed.

Standing point of this system


About data or tool dependent problems

  • Unfortunately, academic tools sometimes cause the data dependent problems or version or option dependent problems.
  • Our pipelines are also difficult to avoid these problems because our pipelines are composed of famous academic tools.
  • We do not check all possibility of options or versions.
  • Our pipelines may also remain bugs despite checking with successful completion using common data and with responses against some test patterns we did.

Support range of us

  • Our team will support to the extent possible for successful execution of pipelines with the default options.
  • However we will not support following situations.
    1. Input data is not appropriate because it is too huge, too few or not common format.
    2. When you used unexpected option values.
  • If the problem is not specific in our system, you must solve a problem yourself.
  • Sometimes you may solve the problem using bioinformatics experts communities like SeqAnswers and BioStars.
  • Our team also maintains a site NGS Surfer's Wiki to share problems and the solution in Japanese.
  • Auto generated command line described above is useful to share what you did with bioinformatics experts.
  • You can also use error messages from tools for a clue to solve the problem.
  • Detail about how to debug is described here.

Standpoint

  • Image
  • When you are not familiar with CUI and you have enough money for commercial analysis service or commercial analysis package licenses, we recommend you to consider using these choices to avoid academic tools dependent problems.
  • Even if you are considering them, analysis overviews from us may help you to determine the service or the package under consideration meets requirements of you.
  • There are other similar open GUI analysis environment Galaxy, Taverna, and etc.
  • If you are none of the following conditions are true, we ask you to use the other analysis choice because our limited computer resource should be preferentially used for non-bioinformaticians.
    • You do not have available computer resource.
    • You are not familiar with CUI.
    • You want to use original funtion of our system.
    • You will learn academic tool pipelines by this system.
    • You will teach this system for non-bioinformatician users.
    • You will register your workflow to our system.
    • You are a project member of the Cell Innovation Program.
  • If you want to transplant this system using your own computer resources, please contact us . We welcome the trial.

For overcoming barrier


  • We hope our system or other GUI analysis environment eliminates requirements of CUI for many non-bioinformatician.
  • However researchers who inevitably beyond the barrier are not few because many research require advanced and not established methods.
  • So there are some efforts of our team to support researchers and technical staffs from exceeding this wall.
  • Image

First Step

  • We provide full-automated pipelines on Maser which enable you to run an analysis work flow about your own data without any prior knowledge most.
  • As the result of the pipelines, you can get candidate list and some reports about data analysis.
  • We also provide some guide information for pipeline selection from here.
  • Note that these pipelines necessarily do not meet all requirements of you and you have to tune it.
  • See an example from here.

  • You can tune pipelines in our GUI interface on Maser.
  • You can change tool versions, mapping conditions, organisms , thresholds like p-value or q-value and others in select box or free text area when you run a pipeline.
  • Since option change frequently causes some errors, we recommend you to use options meet one of the following conditions.
    1. Options someone already used
    2. Options you tested your local environment or unit pipelines described below.
  • See example of setting options here.

Third Step

  • You can replace tools and add steps in our system without knowing the detail difference of data formats and tools on Maser.
  • Detail is described below.

Fourth Step

  • There are still high hurdles this step.
  • We produce auto generated command lines, which can be used in any local environment if required tools and data are available.
  • We describe how to get required data in the pipeline description pages.
  • Preparation of tools in your local environment is one of the hardest tasks, so we provide an example of the installation process in our machine environment on this site, which may help you to know how easy or difficult to install.

Our pipeline features


General analysis workflow system

  • The figure below shows an example of two common scenario about analysis systems.
  • Image
  • The pipeline system A is produced automated analysis workflows but you can not customize them because their inside were not opened.
    • Commercial packages tend to have high reliability, but their customizability is limited because they are not opened.
  • The pipeline system B has high flexibility.
    • Therefore it requires a deep understanding of tools difference and data formats.
    • Sometimes you have to check the behavior of constituting tools in your local environment beforehand to customize.
    • Such situation is too tough for non-bioinformaticians.

Two types of pipelines

  • Our team provides two types of pipelines.
  • Image
  • One of the two types is full-course pipeline.
    • This type of pipeline provides one example of an overall analysis workflow, which covers from raw sequences as the start to final outputs.
    • What you can change is only input data and several determined parameters.
  • The other type of pipelines is unit pipelines.
    • This type of pipeline simply provides one function to users.
    • Sometimes the main function as well as pre-processes and post-processes constitute of one unit pipeline constitutes.
    • Since some full-course pipelines consists of these unit pipelines, you can easily reconstruct the overall analysis and customize them.

Customizability

  • Image
  • You can easily use other tools for a unit pipeline.
  • If a tool you demand is not on our pipeline list, you can do a step in your local environment.

Unificatino of data format types

  • Image
  • We unify data format type for inputs and outputs as much as possible to connect smoothly unit pipelines.
  • Sometimes we add pre or post process to unify data format type, which is not necessarily optimal in terms of execution time and efficiency of the computers.
  • Note that some pipeline do not obey this unification rule because of the delay of maintenance.

Feature description of Maser for analysis experts


Grouping commands and grouping data files

  • In our pipeline system, several commands are grouped into one module of pipeline.
  • Grouping policies are the following.
    1. We grouped processes into a unit which we can explain the meaning of the analysis process to a non-bioinformatician.
    2. We grouped processes into a unit which generate data used in other analysis purpose or data stored middle or long term.
  • We also grouped some files into one unit dataset as process input and output.
  • Grouping policy is almost same as about processes.
    1. We grouped data files into a unit which we can explain the meaning of the data to a non-bioinformatician.
    2. We grouped data files into a unit which is used other purpose or store long term.
  • Image
  • The figure above shows examples of grouping processes and grouping data.

About sequence data

    • Above 'Illumina data' and 'SOLiD data' are both sequence dataset types.
    • It is important that the data is a kind of raw sequence data for non-bioinformatician.
    • The following details are trivial matter, so we use only fastq as an unified sequence format.
      • Sequence file for read 1 and sequence file for read 2 are separated.
      • Sequence information and corresponding quality values are integrated into 1 fastq file.
      • Sequence file for orphan reads which do not have its pair is absent or not.
    • Note that we distinguish 'fastq (single-end)' for single-end fastq and 'fastq (paired-end)' for paired-end fastq because most of tools operate them in a different manner.
    • You can also use 'fasta+qual' format which is distinct from fastq format.
      • In the 'fasta+qual' format, sequence fasta file and its corresponding quality file are separated.
      • You have to immediately convert 'fasta+qual' into fastq because most of our pipelines do not deal with this format.
    • We also distinguish standard base space sequence files and color space sequence files.
      • Because many popular tools support csfasta and separated quality file as inputs, we consider csfasta+qual format is the unified sequence format for color space sequence.
      • If a tool requires a csfastq format in which color space sequence data and quality values are integrated, you have to convert from cafasta+qual to csfastq in its pre-process.

About mapping result data

    • One of the de-facto standard mapping result data format is SAM or BAM, compressed version of SAM.
    • There would be no interest in the following facts for non-bioinformaticians.
      • The file size of BAM is quite smaller than SAM.
      • You can see the contents of BAM by samtools without decompression.
      • SAM and BAM can convert completely reversible.
      • There are two types of BAM, unsorted (usually name sorted) BAM and sorted (position sorted) BAM.
      • Name sorted BAM and position sorted BAM can also convert reversible.
      • Sometimes an index file is added to sorted BAM for a quick access to the mapping data.
      • Sorted BAM with index files are used many tools including IGV, samtools and etc.
    • We consider sorted BAM with an index file is an unified mapping result format.
    • Even if some tools require name sorted BAM as input, we force position sorted BAM as the common input format.
    • With some exception, we do not use other mapping formats as inputs nor outputs.
    • We think increase in overhead for unification is unavoidable sacrifice for simple processes connection.

About mapping processes

    • It is important that the process is a kind of mapping reads to a reference genome for non-bioinformatician.
    • The following details are trivial matter.
      • First major version of BWA program has two steps of mapping processes.
      • Output format of BWA is not BAM but SAM.
      • Usually SAM can convert to BAM with samtools or picard tool to save storing disk space.
    • Since bowtie tool requires orphan removed sequences as its input, you have to remove orphans before mapping from common SOLiD output sequences.
    • So we hide such detail pre-processes and post-processes from users by processes grouping.

    Use of data and data storing strategy

About storing data

    • We think the following should be stored in medium or long term.
      1. Original raw data
      2. Data which can be used for further analyses or other analyses
      3. Data for viewing the result
    • As you know that size of data from Next Generation Sequencer are huge, it is very important for the efficiency of the disk usage to determine what to save and what to discard.

About intermediates

    • At the same time, we also think intermediate data is important for troubleshooting or pipeline debugging.
    • However the demand for these intermediates will disappear after a while.
    • Therefore deletion timing of these intermediate is important but we think users should not be bothered to manage them.

About auto deletion of intermediates

    • In our system, these intermediates automatically delete after few days when the pipeline end successfully.
    • If the pipeline is abnormally aborted, these intermediate data are important for debugging and detection for the cause of the problem, so we exclude them from the automatic deletion.
    • If these intermediates has removed immediately after the pipeline running, you might have to rerun several steps of commands for debugging even if these steps were very time consuming processes.

Merit of automatic deletion

    • For also this purpose, processes grouping work efficiently.
    • All of our pipelines are designed with this in mind, users do no need to distinguish intermediate from others.
    • Administrators also do not have to bother about this because the deletion is automatically done.
    • In our servers, hundred gigabytes of data are automatically removed per a week, which means this function save hundred gigabytes of disk space regularly.
    • I hope you notice that users and the pipeline developers stand different layer and our system controls the different layer appropriately.

Actual view

    • Image
    • You can recognize that a module of process 'BWA (PE)' for Paired-End fastq consists of the following three inner steps.
      1. bwa aln (paired-end)
      2. bwa sampe (paired-end)
      3. samToBam/sort/indexing SAMtools with head
    • In a step 'bwa aln (paired-end)', 'bwa aln' command is executed twice per each sequence file.
    • In a step 'bwa sampe (paired-end)', output files from bwa aln and original sequence files are used as inputs.
    • In a step 'samToBam/sort/indexing SAMtools with head', 'samtools view' command, 'samtools sort' command and 'samtools index' command are combined and executed.