Programmatic recreation of BII-S-3 with the ISA-API

Abstract

With this practical, we’ll be showing how to use the ISA-API object to recreate programmatically the BII-S-3 dataset published in 2008 by Dr Jack Gilbert and colleagues.

The practical demonstrates a so-called “roundtrip” exercice where:

  • Programmatically generated ISA objects are written (serialized) to disk in the ISA-Tab format.

  • The resulting output is then read back into memory by invoking the ISA-API load function.

  • The objects loaded in memory are serialized back out again and compared to the first output to ensure equivalent and integrity to demonstrate validity of the parsing.

1. Let’s start by loading the ISA-API

from isatools.model import (
    Comment,
    Investigation,
    Study,
    StudyFactor,
    FactorValue,
    OntologyAnnotation,
    Characteristic,
    OntologySource,
    Material,
    Sample,
    Source,
    Protocol,
    ProtocolParameter,
    ProtocolComponent,
    ParameterValue,
    Process,
    Publication,
    Person,
    Assay,
    DataFile,
    plink
)
import datetime
import os

2. Building the ISA objects for representing BII-S-3

investigation = Investigation()
# i_comment = Comment(name="i_comment", value="i_value")
# investigation.comments.append(i_comment)


# Declaring the Ontologies and Vocabularies used in the ISA Study
# dummy_onto=OntologySource(name="Dumbo",description="")
chebi=OntologySource(name="CHEBI",description="Chemical Entity of Biological Interest")
efo=OntologySource(name="EFO", description="Experimental Factor Ontology")
obi = OntologySource(name='OBI', description="Ontology for Biomedical Investigations")
pato = OntologySource(name='PATO', description="Phenotype and Trait Ontology")
ncbitaxon = OntologySource(name="NCIBTaxon", description="NCBI Taxonomy")
investigation.ontology_source_references=[chebi,efo,obi,pato,ncbitaxon]


study = Study(filename="s_BII-S-3-synthesis.txt")
# st_comment = Comment(name="st_comment", value="st_value")
# study.comments.append(st_comment)
study.identifier = "BII-S-3-synth"
study.title = "Metagenomes and Metatranscriptomes of phytoplankton blooms from an ocean acidification mesocosm experiment"
study.description = "Sequencing the metatranscriptome can provide information about the response of organisms to varying environmental conditions. We present a methodology for obtaining random whole-community mRNA from a complex microbial assemblage using Pyrosequencing. The metatranscriptome had, with minimum contamination by ribosomal RNA, significant coverage of abundant transcripts, and included significantly more potentially novel proteins than in the metagenome. This experiment is part of a much larger experiment. We have produced 4 454 metatranscriptomic datasets and 6 454 metagenomic datasets. These were derived from 4 samples."
study.submission_date = "15/08/2008"
study.public_release_date = "15/08/2008"

# These NCBI SRA related ISA Comments fields are required and must be present for the ISA SRAconverter is to be invoked later
src_comment_sra1 = Comment(name="SRA Broker Name", value="OXFORD")
src_comment_sra2 = Comment(name="SRA Center Name", value="OXFORD")
src_comment_sra3 = Comment(name="SRA Center Project Name", value="OXFORD")
src_comment_sra4 = Comment(name="SRA Lab Name", value="Oxford e-Research Centre")
src_comment_sra5 = Comment(name="SRA Submission Action", value="ADD")
study.comments.append(src_comment_sra1)
study.comments.append(src_comment_sra2)
study.comments.append(src_comment_sra3)
study.comments.append(src_comment_sra4)
study.comments.append(src_comment_sra5)

# These ISA Comments are optional and may be used to report funding information
src_comment_st1 = Comment(name="Study Funding Agency", value="")
src_comment_st2 = Comment(name="Study Grant Number", value="")
study.comments.append(src_comment_st1)
study.comments.append(src_comment_st2)
   
# Declaring all the protocols used in the ISA study. Note also the declaration of Protocol Parameters when needed.
study.protocols = [ 
    Protocol(name="environmental material collection - standard procedure 1",
             description="Waters samples were prefiltered through a 1.6 um GF/A glass fibre filter to reduce Eukaryotic contamination. Filtrate was then collected on a 0.2 um Sterivex (millipore) filter which was frozen in liquid nitrogen until nucelic acid extraction. CO2 bubbled through 11000 L mesocosm to simulate ocean acidification predicted conditions. Then phosphate and nitrate were added to induce a phytoplankton bloom.",
             protocol_type=OntologyAnnotation(term="sample collection"),
             parameters=[
              ProtocolParameter(parameter_name=OntologyAnnotation(term="filter pore size"))
             ]
            ),
    Protocol(name="aliquoting-0", description="aliquoting", protocol_type=OntologyAnnotation(term="sample collection")),
    Protocol(
        name="nucleic acid extraction",
        description="Total nucleic acid extraction was done as quickly as possible using the method of Neufeld et al, 2007.",
        protocol_type=OntologyAnnotation(term="nucleic acid extraction")
    ),
    Protocol(
        name="mRNA extraction - standard procedure 3",
        description="RNA MinElute + substrative Hybridization + MEGAclear For transcriptomics, total RNA was separated from the columns using the RNA MinElute clean-up kit (Qiagen) and checked for integrity of rRNA using an Agilent bioanalyser (RNA nano6000 chip). High integrity rRNA is essential for subtractive hybridization. Samples were treated with Turbo DNA-free enzyme (Ambion) to remove contaminating DNA. The rRNA was removed from mRNA by subtractive hybridization (Microbe Express Kit, Ambion), and absence of rRNA and DNA contamination was confirmed using the Agilent bioanalyser. The mRNA was further purified with the MEGAclearTM kit (Ambion). Reverse transcription of mRNA was performed using the SuperScript III enzyme (Invitrogen) with random hexamer primers (Promega). The cDNA was treated with RiboShredderTM RNase Blend (Epicentre) to remove trace RNA contaminants. To improve the yield of cDNA, samples were subjected to random amplification using the GenomiPhi V2 method (GE Healthcare). GenomiPhi technology produces branched DNA molecules that are recalcitrant to the pyrosequencing methodology. Therefore amplified samples were treated with S1 nuclease using the method of Zhang et al.2006.",
        protocol_type=OntologyAnnotation(term="labeling") #nucleic acid extraction
    ),
    Protocol(
        name="genomic DNA extraction - standard procedure 4",
        description="superscript+random hexamer primer",
        protocol_type=OntologyAnnotation(term="nucleic acid extraction")
    ),
    Protocol(
        name="reverse transcription - standard procedure 5",
        description="",
        protocol_type=OntologyAnnotation(term="reverse transcription"),
    ),
     Protocol(
        name="library construction",
        description="",
        protocol_type=OntologyAnnotation(term="library construction"),
        parameters=[
            ProtocolParameter(parameter_name=OntologyAnnotation(term="library strategy")),
            ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")),
            ProtocolParameter(parameter_name=OntologyAnnotation(term="library selection"))
        ]
    ),   

    Protocol(
        name="nucleic acid sequencing", #pyrosequencing - standard procedure 6",
        description="1. Sample Input and Fragmentation: The Genome Sequencer FLX System supports the sequencing of samples from a wide variety of starting materials including genomic DNA, PCR products, BACs, and cDNA. Samples such as genomic DNA and BACs are fractionated into small, 300- to 800-base pair fragments. For smaller samples, such as small non-coding RNA or PCR amplicons, fragmentation is not required. Instead, short PCR products amplified using Genome Sequencer fusion primers can be used for immobilization onto DNA capture beads as shown below.",
        protocol_type=OntologyAnnotation(term="nucleic acid sequencing"),
        parameters=[
            ProtocolParameter(parameter_name=OntologyAnnotation(term="sequencing instrument"))
        ]
    ),
    Protocol(
        name="sequence analysis - standard procedure 7",
        description="",
        protocol_type=OntologyAnnotation(term="data transformation")
    )
]


# Adding a Study Design descriptor to the ISA Study object
intervention_design = OntologyAnnotation(term_source=obi)
intervention_design.term = "intervention design"
intervention_design.term_accession = "http://purl.obolibrary.org/obo/OBI_0000115"
study.design_descriptors.append(intervention_design)


# Declaring the Study Factors
study.factors = [
    StudyFactor(name="compound",factor_type=OntologyAnnotation(term="chemical substance",
                                                              term_accession="http://purl.obolibrary.org/obo/CHEBI_59999",
                                                              term_source=chebi)),
    StudyFactor(name="dose",factor_type=OntologyAnnotation(term="dose", term_accession="http://www.ebi.ac.uk/efo/EFO_0000428",term_source=efo)),
    StudyFactor(name="collection time",factor_type=OntologyAnnotation(term="time", term_accession="http://purl.obolibrary.org/obo/PATO_0000165", term_source=pato))
]

# Associating the levels to each of the Study Factor.
fv1 = FactorValue(factor_name=study.factors[0], value=OntologyAnnotation(term="carbon dioxide"))
fv2 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="high"))
fv3 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="normal"))
fv4 = FactorValue(factor_name=study.factors[2], value="may 13th, 2006")
fv5 = FactorValue(factor_name=study.factors[2], value="may 19th, 2006")


# Adding the publications associated to the study
study.publications = [
    Publication(doi="10.1371/journal.pone.0003042",pubmed_id="18725995",
                title="Detection of large numbers of novel sequences in the metatranscriptomes of complex marine microbial communities.",
                status=OntologyAnnotation(term="indexed in PubMed"),
                author_list="Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I."),
    Publication(doi="10.1111/j.1462-2920.2008.01745.x",
                title="Potential for phosphonoacetate utilization by marine bacteria in temperate coastal waters",
               pubmed_id="18783384",
               status=OntologyAnnotation(term="indexed in PubMed"),
               author_list="Gilbert JA, Thomas S, Cooley NA, Kulakova A, Field D, Booth T, McGrath JW, Quinn JP, Joint I.")   
]


# Adding the authors of the study
study.contacts = [
    Person(first_name="Jack", last_name="Gilbert", affiliation="Plymouth Marine Laboratory", email="jagi@pml.ac.uk",
           address="Prospect Place, Plymouth, United Kingdom",
           comments=[Comment(name="Study Person REF", value="")],
            roles=[OntologyAnnotation(term="principal investigator role"),
                   OntologyAnnotation(term="SRA Inform On Status"),
                   OntologyAnnotation(term="SRA Inform On Error")]
    ),
    Person(first_name="Dawn", last_name="Field", affiliation="NERC Centre for Ecology and Hydrology",
           address="CEH Oxford, Oxford, United Kingdom",
           comments=[Comment(name="Study Person REF", value="")],
           roles=[OntologyAnnotation(term="principal investigator role")]
    )
]

Now, creating ISA Source and Sample objects and building what is the BII-S-3 Study table

study.sources = [Source(name="GSM255770"), Source(name="GSM255771"),Source(name="GSM255772"),Source(name="GSM255773")]
study.samples = [Sample(name="GSM255770"), Sample(name="GSM255771"), Sample(name="GSM255772"), Sample(name="GSM255773")]

# Note how the treatment groups are defined as sets of factor values attached to the ISA.Sample object
study.samples[0].factor_values=[fv1,fv2,fv4]
study.samples[1].factor_values=[fv1,fv3,fv4]
study.samples[2].factor_values=[fv1,fv2,fv5]
study.samples[3].factor_values=[fv1,fv3,fv5]

characteristic_organism = Characteristic(category=OntologyAnnotation(term="Organism"),
                                     value=OntologyAnnotation(term="marine metagenome", term_source=ncbitaxon,
                                                              term_accession="http://purl.obolibrary.org/obo/NCBITaxon_408172"))


# Now creating a Process showing a `Protocol Application` using Source as input and producing Sample as output.
for i in range(len(study.sources)):
    
    study.sources[i].characteristics.append(characteristic_organism)


    study.process_sequence.append(Process(executes_protocol=study.protocols[0],
                                         inputs=[study.sources[i]],
                                         outputs=[study.samples[i]])
                                                  )

# Now appending the ISA Study object to the ISA Investigation object    
investigation.studies = [study]

Now, creating the ISA objects to represent assays and acquired raw data

# Starting by declaring the 2 types of assays used in BII-S-3 as coded with ISAcreator tool
assay = Assay(filename="a_gilbert-assay-Gx.txt")
assay.measurement_type = OntologyAnnotation(term="metagenome sequencing",term_accession="http://purl.obolibrary.org/obo/OBI_0002623", term_source=obi)
assay.technology_type = OntologyAnnotation(term="nucleotide sequencing", term_accession="http://purl.obolibrary.org/obo/OBI_0000626", term_source=obi)
# assay.technology_type = OntologyAnnotation(term="")



assay_tx = Assay(filename="a_gilbert-assay-Tx.txt")
assay_tx.measurement_type = OntologyAnnotation(term="transcription profiling",term_accession="http://purl.obolibrary.org/obo/OBI_0000424", term_source=obi)
assay_tx.technology_type = OntologyAnnotation(term="nucleotide sequencing",term_accession="http://purl.obolibrary.org/obo/OBI_0000626", term_source=obi)

warning make sure the plink function connects the first and last protocol in a chain

for i, sample in enumerate(study.samples):
    
#     # create an aliquoting process which creates new samples from existing samples created in a study.processSequence
#     aliquoting_process = Process(executes_protocol=study.protocols[1])
#     aliquot = Sample(name="aliquot-{}".format(i), derives_from=sample)
#     aliquoting_process.inputs.append(sample)
#     aliquoting_process.outputs.append(aliquot)

    # create an extraction process that executes the extraction protocol

    extraction_process = Process(executes_protocol=study.protocols[0])

    # extraction process takes as input a sample, and produces an extract material as output
    
    char_ext = Characteristic(category=OntologyAnnotation(term="Material Type"),
                                     value=OntologyAnnotation(term="pellet"))
    
    char_ext1 = Characteristic(category=OntologyAnnotation(term="quantity"),
                                     value=OntologyAnnotation(term="loads"))

    # extraction_process.inputs.append(aliquot)
    extraction_process.inputs.append(sample)
    material = Material(name="extract-{}".format(i))
    material.type = "Extract Name"
    material.characteristics.append(char_ext)
    material.characteristics.append(char_ext1)
    extraction_process.outputs.append(material)

    # create a sequencing process that executes the sequencing protocol

    sequencing_process = Process(executes_protocol=study.protocols[7])
    sequencing_process.name = "assay-name-{}".format(i)
    sequencing_process.inputs.append(extraction_process.outputs[0])
    # sequencing_process.inputs.append(material)

    # Sequencing process usually has an output data file

    datafile = DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")
    data_comment = Comment(name="data_comment",value="data_value")
    datafile.comments.append(data_comment)
    sequencing_process.outputs.append(datafile)

    # Ensure Processes are linked forward and backward. plink(from_process, to_process) is a function to set
    # these links for you. It is found in the isatools.model package

    # plink(aliquoting_process, sequencing_process)
    plink(extraction_process, sequencing_process)
    # make sure the extract, data file, and the processes are attached to the assay


    assay.samples.append(sample)
    # assay.samples.append(aliquot)
    assay.other_material.append(material)
    assay.data_files.append(datafile)
    
    assay.process_sequence.append(extraction_process)
    assay.process_sequence.append(sequencing_process)

    # create an extraction process that executes the extraction protocol

    extraction_process_tx = Process(executes_protocol=study.protocols[2])

    # extraction process takes as input a sample, and produces an extract material as output

    extraction_process_tx.inputs.append(sample)
    # extraction_process_tx.outputs.append(sample)
    # extraction_process_tx.outputs=[]
    # material_tx = Material(name="extract-{}".format(i))
    # material_tx.type = "Extract Name"
    # extraction_process_tx.outputs.append(material_tx)

    # labeling process takes as input an extract, and produces a labeled extract material as output
    labeling_process_tx = Process(executes_protocol=study.protocols[3])
    # labeling_process_tx.inputs.append(sample)
    # labeling_process_tx.inputs=[]
    material_tx = Material(name="le-{}".format(i))
    material_tx.type = "Labeled Extract Name"
    labeling_process_tx.outputs.append(material_tx)
    
    
    # create a sequencing process that executes the sequencing protocol
    # seq_pv1 = ParameterValue(category="library layout", value="SINGLE")
    seq_pv2 = ParameterValue(category=ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")), value=OntologyAnnotation(term="SINGLE"))
    sequencing_process_tx = Process(executes_protocol=study.protocols[6], parameter_values=[seq_pv2])
    sequencing_process_tx.name = "assay-name-tx-{}".format(i)
    sequencing_process_tx.inputs.append(labeling_process_tx.outputs[0])

    # Sequencing process usually has an output data file
    
    datafile_tx = DataFile(filename="sequenced-data-tx-{}".format(i), label="Raw Data File")
    sequencing_process_tx.outputs.append(datafile_tx)

    # Ensure Processes are linked forward and backward. plink(from_process, to_process) is a function to set
    # these links for you. It is found in the isatools.model package

   
    # plink(extraction_process_tx, labeling_process_tx)
    # plink(labeling_process_tx, sequencing_process_tx)
    plink(extraction_process_tx, sequencing_process_tx)
    # make sure the extract, data file, and the processes are attached to the assay


    assay_tx.samples.append(sample)
    assay_tx.other_material.append(material_tx)
    assay_tx.data_files.append(datafile_tx)
    
    assay_tx.process_sequence.append(extraction_process_tx)
    assay_tx.process_sequence.append(labeling_process_tx) 
    assay_tx.process_sequence.append(sequencing_process_tx)    

Adding both assays to the ISA Study object

study.assays.append(assay)
study.assays.append(assay_tx)
# from isatools import isatab
# print(isatab.dumps(investigation))
# assay_g = investigation.studies[0].assays[1]
# assay_t = investigation.studies[0].assays[0]

# assay_t.samples=investigation.studies[0].samples


# extract1 = Material(name="GSM255770.e1")
# extract1.type = "Extract Name"
# extract2 = Material(name="GSM255771.e1")
# extract2.type = "Extract Name"
# extract3 = Material(name="GSM255772.e1")
# extract3.type = "Extract Name"
# extract4 = Material(name="GSM255773.e1")
# extract4.type = "Extract Name"
# extract5 = Material(name="GSM255774.e1")
# extract5.type = "Extract Name"

# assay_t.other_material.append(extract1)
# assay_t.other_material.append(extract2)
# assay_t.other_material.append(extract3)
# assay_t.other_material.append(extract4)
# assay_t.other_material.append(extract5)


# for i in range(len(study.samples)):

#     assay_t.process_sequence.append(Process(
#             executes_protocol=study.protocols[1],
#                 inputs=study.samples[i],
#                 outputs=assay_t.other_material[i]
#         ))
    
#     data=DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")  
    
#     assay_t.process_sequence.append(Process(
#         executes_protocol=study.protocols[3],
#                 inputs=assay_t.other_material[i]
#     ))
    
#     plink(assay_t.process_sequence[0], assay_t.process_sequence[1])
#     assay_t.process_sequence[-1].outputs.append(data)
#     assay_t.data_files.append(data)

Writing the ISA object representation to file with the ISA-API dump function

from isatools.isatab import dump

# note the use of the flag for explicit serialization on factor values on assay tables
dump(investigation, "./output/BII-S-3-synth/", write_factor_values_in_assay_table=True)

Reading the ISA document from disk back in, loading it into memory and writing to disk again to check that the ISA-API load function works nominally

from isatools.isatab import load
with open(os.path.join("./output/BII-S-3-synth/","i_investigation.txt")) as bii3stest:
    roundtrip = load(bii3stest)
# note the use of the flag for explicit serialization on factor values on assay tables
dump(roundtrip, "./notebook-output/BII-S-3-roundtrip/", write_factor_values_in_assay_table=True)

Comparing the output of each serialization and checking for equivalence

# import hashlib

# hashlib.md5(open(os.path.join("./output/","i_investigation.txt")).read()).hexdigest()
import filecmp

filecmp.cmp('./output/BII-S-3-synth/i_investigation.txt', './notebook-output/BII-S-3-roundtrip/i_investigation.txt')
filecmp.cmp('./output/BII-S-3-synth/s_BII-S-3-synthesis.txt', './notebook-output/BII-S-3-roundtrip/s_BII-S-3-synthesis.txt')
filecmp.cmp('./output/BII-S-3-synth/a_gilbert-assay-Gx.txt', './notebook-output/BII-S-3-roundtrip/a_gilbert-assay-Gx.txt')
filecmp.cmp('./output/BII-S-3-synth/a_gilbert-assay-Tx.txt', './notebook-output/BII-S-3-roundtrip/a_gilbert-assay-Tx.txt')

Conclusion:

While the i_investigation.txt files are identical, the other comparisons return False indicating that they are not exactly identical.

The cause of the difference is simply down to the order in which FactorValues are serialized. This is caused by the underlying data structure used to store this information but the the information is equivalent.

About this notebook