Overview
The SOAPdenovo2 case study is a reproducibility study aimed at exploring how existing research objects and workflow enactment engines can help assess, record and preserve scientific workflows and associated findings by reviewing a comparison between sequence assembly algorithm performance in the light of development of the SOAPdenovo2 de novo genome assembler. The case study was a joint effort by the GigaScience journal, the Investigation/Study/Assay (ISA) infrastructure, Nanopublication (Nanopub) and Research Object (RO) communities and SOAPdenovo2 de novo genome assembler developers.
The following presentation, delivered at the International Society for Molecular Biology (ISMB) 2014 workshop on What bioinformaticians need to know about digital publishing beyond the PDF2" in Boston, USA, showcases the SOAPdenovo2 study and explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
The following publication describes the SOAPdenovo2 case study and our recommendations to improving scholarly publishing using research object models.
An earlier version was available as a pre-print.
Also, this work was presented at the Bionformatics Open Source Conference (BOSC) 2015 and these are the slides:
Galaxy Workflows
The Galaxy workflows and corresponding histories for reproducing part of the SOAPdenovo2 study, i.e. Table 2 from the original paper, can be found at the GigaGalaxy server, and these are the specific links, classified by organism and genome assembler:
Organism / Assembler | SOAPdenovo2 | SOAPdenovo1 | AllPathsLG |
S. Aureus | Workflow | Workflow | Workflow |
History | History | History | |
R. Sphaeroides | Workflow | Workflow | Workflow |
History | History | History |
Data Models: ISA, RO and Nanopublication
The ISA model, with its focus on experimental design, insists on the declaration of study plans (e.g., experimental factors considered) and provides cues for reviewers to assess content and suitability of the plans. Furthermore, the underlying model ensures that inputs and outputs of processes, or workflows, are declared and identified, referring to existing database identifiers when relevant. Initially intended to draw the graph of sample processing through to coarse data processing, the ISA grammar is generic enough to cover computational processing while allowing referencing to more granular forms such as Galaxy files.
ISA and RO both provide means to track experimental and computational workflows respectively, with some level of acknowledged overlap which is handled by deferring to the domain specific resources, with Research Object project recommending ISA for the biological domain. Finally, since describing how data are acquired, generated and analyzed is only part of the story, the description of the findings requires attention. The Nanopublication model tackles what used to represent the blind side of data reporting: capturing experimental conclusion.
Investigation-Study-Assay
- Scope:
- Experimental Design, Variable, Material Processing, Data Processing workflows.
- Outcomes:
-
- an tab-delimited archive presenting an overview of the SOAPdenovo2 experiment following the ISA-TAB specification: it includes a description of the experimental design (e.g. independent and response variables), the genomes and data used in SOAPdenovo2 together with stable identifiers, a description of the experimental and computational workflows for evaluation of SOAPdenovo2 with its predecessors SOAPdenovo1 and ALLPATHSLG, including their inputs and outputs, provenance and attribution information
- an explicit OWL/RDF semantic representation generated using ISA2OWL software component and relying on mappings between the ISA syntax and ontological resources such as the Ontology for Biomedical Investigations (OBI) and the Provenance ontology (PROV-O).
Nanopublication
- Scope:
- Key findings, supporting evidence
- Outcome:
-
- NanoMaton template: an OntoMaton Google spreadsheet Nanopublication Template, where OntoMaton is a Google Spreadsheets widget relying on services for ontology lookup and annotation from NCBO Bioportal and Linked Open Vocabularies .
- a completed NanoMaton template holding all nanopublication as structured tables
- the NanoMaton code to convert from the Google Spreadsheet template to RDF serializations of the nanopublications, relying on the NanoPub-Java library
- RDF serializations according to the Nanopub guidelines
Research Object
- Scope:
- Scientific workflow artifacts
- Outcome:
Queries
We provide a set of queries demonstrating how the data models can be used to inspect the information about the SOAPdenovo2 study and its results. The following table summarises the queries and the model(s) used to answer them. The queries themselves and links to execute them can be found through the table and below.
Query Case | RO | ISA | Nanopub |
---|---|---|---|
Who were involved in the study? | See and execute query | ||
What are the inputs and outputs for all the data transformations in the study? | (for inputs) See query | See and execute query | |
What are the Galaxy workflows related to the SOAPdenovo2 case study? | See and execute query | ||
What was the study design? | See and execute query | ||
What are the study factors (or independent variables) and their levels (or values they assumed)? | See and execute query | ||
How many study groups are there? | See and execute query | ||
Which are the study groups? | See and execute query | ||
Which are the members of the study groups? | See and execute query | ||
What are the sizes of the study groups? | See and execute query | ||
What was funding agency of the study? | See and execute query | ||
What is the licence for the metadata? | See and execut query | ||
What is the PubMed identifier for the associated publications(s) for the study? | See and execute query | ||
Find all the nanopublications related to the study | See query | See and execute query | See and execute query |
Find the authors for each assertion in the nanopublications | See and execute query | ||
Find the nanopublications and their associated authors | See and execute query |
Next, we list the different queries and provide links to the results of executing them in an SPARQL endpoint.
Queries over the ISA research model
Who where the people involved in creating the ISA-Tab representation and what were their roles?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What are the inputs and outputs for all the data transformations in the study?
data_transf_inputs_outputs.sparql
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What are the Galaxy workflows related to the SOAPdenovo2 case study?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What was the study design?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What are the study factors (or independent variables) and their levels?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
How many study groups are there?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
Which are the study groups?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
Which are the members of the study groups?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What are the sizes of the study groups?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What was funding agency of the study?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What is the licence for the metadata?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What is the PubMed identifier for the associated publications(s) for the study?
study_publication_pubmedid.sparql
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
What are the nanopublications genereated for the study?
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
Queries over the Nanopublication research model
Find all the nanopublications related to the study
Find the authors for each assertion in the nanopublications
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
Find the nanopublications and their associated authors
Execute this SPARQL query over the SOAPdenovo2 named graph (http://w3id.org/isa/soapdenovo2).
Queries over the Research Object
hybrid_inputs_isa_provenance.sparql
hybrid_workflow_isa_study.sparql
inputs_derived_gage_output.sparql
workflow_generated_gage_results.sparql
Contributors
- Alejandra Gonzalez-Beltran (@agbeltran), Oxford e-Research Centre, University of Oxford, UK
- Peter Li (@pli888), GigaScience, BGI HK Research Institute, Hong Kong.
- Jun Zhao, InfoLab21, Lancaster University
- Mark Thompson, Department of Human Genetics, Leiden University Medical Center, The Netherlands
- Maria Susana Avila-Garcia, Nuffield Department of Medicine, Experimental Medicine Division, John Radcliffe Hospital,, Oxford, UK .
- Ruibang Luo, HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory & Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
- Tak-Wah Lam, HKU-BGI Bioinformatics Algorithms and Core Technology Research Laboratory & Department of Computer Science, University of Hong Kong, Pokfulam, Hong Kong.
- Tin-Lap Lee, School of Biomedical Sciences and CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, Hong Kong.
- Marco Roos, Department of Human Genetics, Leiden University Medical Center, The Netherlands
- Scott Edmunds, GigaScience, BGI HK Research Institute, Hong Kong.
- Susanna-Assunta Sansone, Oxford e-Research Centre, University of Oxford
- Philippe Rocca-Serra, Oxford e-Research Centre, University of Oxford
Support or Contact
For discussions about the SOAPdenovo2 case study, please contact Alejandra, Peter and Philippe.
The issue tracker is available at: SOAPdenovo2 case study GitHub site. Please, feel free to report issues or feature requests.