Table of Contents
- Considering community-based standards
- An evolving landscape with a complex life cycle
- Formulating: use cases, scope and prioritization
- Conducting: iterations, feedback and requests
- Maintaining: implementations, education, sustainability and evolution
- Realizing the Potential
Through a number of activities and events, the National Institutes of Health (NIH) Big Data to Knowledge initiative (BD2K) will make the biomedical research enterprise more data-centric, with a vision for findable, accessible, citable, interoperable and reusable datasets that are linked with other research products to fuel new investigations and discoveries.
Here we report on the importance of data and metadata-related standards in this ecosystem, and the enabling role of community-based standards efforts. Specifically, we analyze issues that pertain to their entire life cycle, from formulation to adoption and maintenance, and introduce initial opportunities for BD2K to promote and encourage these crucial endeavors.
Biomedical research generates large amounts of complex and diverse data – big data. While the vast majority of these data remain in the labs that produced them, some areas of biomedical research, such as genomics, have a tradition of making data broadly available. The process is often guided by established policies, such as that for NIH Genome Wide Association Studies. In these areas, the availability of data – and seizing opportunities that availability affords (e.g., 1) – has had a major impact on the rate, quantity and quality of scientific progress and its impact on society. In recognition of this, individuals and organizations around the world are rallying to further enhance this paradigm (e.g., 2, 3).
Today, high returns on research investment, like those seen in genomics, are increasingly expected by funding organizations and those who benefit from research. But, the major public products of most of today’s biomedical research enterprise are still limited to concepts, such as hypotheses, interpretations and conclusions, described in scientific papers (e.g., 4). The data underlying these concepts are rarely available, so today’s enterprise is concept-centric.
New technical capabilities and scientific opportunities in biomedical research are making data sharing data both easier and more fruitful. Changing perspectives and expectations of society have encouraged science policy initiatives to increase access to data, reproducibility of research results, and return on investment in research. These converging trends will likely translate into a dramatic increase in the broad availability of biomedical research data. But, even when data are available, they are often not reusable or reproducible by independent investigators due to incomplete annotation of sample descriptions and data processing, or the use of ad hoc or proprietary formats and terminologies. Available data need to be rendered broadly usable – able to work with other data, tools and data resources – through the use of data-related standards. In order to make best use of data-related standards, investigators need to know what standards are available for use. And, in those cases where attention is required to catalyze a standards-related effort with potentially significant impact, support should be available. Such awareness and support would represent important steps in making broadly available data broadly usable and, in turn, making tomorrow’s biomedical research enterprise more data-centric.
Considering community-based standards
The BD2K initiative promises to play a major role in transforming the nature of the biomedical enterprise through three major thrusts. First, it will enable scientific innovation by advancing the science and technology of biomedical big data through the support of centers of excellence and research project grants, as well as by expediting the use of large-scale computing in biomedical research. Second, it will enhance and develop the workforce in biomedical big data through a variety of innovative training initiatives aimed at levels from students to senior investigators. Third, BD2K will facilitate the broad use of biomedical data by working to change NIH policies, practices and culture as they relate to data management, sharing and use. This will include careful consideration of community-based standards efforts.
Therefore gaining a clearer understanding of such dynamic community efforts and how NIH might best relate to them is pivotal and was the scope of this BD2K workshop on Frameworks for Community-based Standards Efforts. Note that the focus of the workshop was limited to standards to represent and share data/results files and the associated contextual metadata (e.g., minimum reporting requirements, common metadata, terminologies – controlled vocabulary, taxonomy, thesaurus, ontology – file formats and conceptual models). Clearly many more types of community-based standards efforts exist (see Text Box 1), and there is no standard way to classify and organize them.
An evolving landscape with a complex life cycle
Mapping the landscape of community-based data and metadata standards is not a simple task. Most efforts are long-term endeavors with many different players and types of efforts. Broadly, these standards initiatives work to allow data sets to be harmonized with regard to structure, formatting and annotation so as to open their content to transparent interpretation, an understanding of how and why the conclusions of a scientific paper were drawn from the data collected. They also facilitate reuse of data, as well as the integrative analysis and comparison to other data sets. Finally, such efforts also interlink data with other research products, such as scholarly material, algorithms and software.
Community-based standards endeavors are well-served by diverse perspectives, as such diversity exposes the spectrum of real-world needs to address. When standards efforts are not only developed for a research community, but also by that research community – with the engagement of a wide range of diverse stakeholders – the likelihood that the standards will be successfully implemented is greatly increased (e.g.5, 6, 7, 8).
An inevitable consequence of having multiple community-based efforts is that they are fragmented, sometimes duplicating efforts of others and sometimes leaving gaps in domains that should be covered. For example, currently, descriptions of experiments in which a sample has been subject to several kinds of assays, using a variety of technologies, are particularly challenging to share as coherent units of research because of the variety of standards for reporting data and experimental metadata with which the parts must be formally represented. Regardless of the context of a particular standards effort, it is important that standards are interoperable to allow research objects to work with one with another, in order to provide consistency across data, tools, codes and resources.
For this reason, awareness of which groups are doing what is vital to a coordinated approach; and, since the activities of any group changes over time, such awareness must be continually updated. For data and metadata, BioSharing provides that at a glance view of almost 600 community-developed content standards in the broad life sciences, linking them to database resources to monitor their use and working to assign criteria for adoption (9). Defining maturity of a standard, for example, is critical to its adoption, so that it can be channeled to the appropriate stakeholder community for use. The ownership of open standards is problematic in broad, grass-roots collaborations, but new business models, including encouraging the involvement of commercial entities, would mitigate these problems. And, the unfunded nature of many standards efforts requires methods and processes to assign formal rewards and provide incentives for all contributors that operate in a voluntary manner.
The challenges, however, extend well beyond these points. Analysis of the community-based standards life cycle has revealed that different issues pertain to each phase (i.e., formulating, conducting and maintaining standards efforts); communities’ social and technical approaches to common problems are also quite diverse. An understanding of such a complex ecosystem of players and types of efforts, and their complex operational life cycle, is pivotal if we are to realize the potential that community-based standards can offer to the biomedical research enterprise of the future.
Formulating: use cases, scope and prioritization
Community-based standards efforts typically start with the identification of a need for standards, often in recognition that not meeting that need has significant opportunity costs. Use cases, sometimes presented in the form of stakeholder stories, are valuable for defining the breath and depth of the requirements – making clear what purpose the standards effort is meant to address. For example, a public health research project might use the standards that are required by a government agency for similar data (such as those required by the USA Centers for Disease Control in public health reporting) to allow for interoperability with and comparison to a much larger data set.
The need for a particular standards effort, such as a common terminology for human diseases, is typically driven by the needs of the community of practice being made known through any of a number of different channels. For example, direct observation of problems in the research community is employed by the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI), the Bioinformatics Resource Centers (BRCs), and PhenX. Organized discussions with polling of the community are the approaches taken by the discussion forum of the Metabolomics Society and the Genomic Standards Consortium (GSC). And, the use of a standing body to which requests for various standards efforts are made is exemplified by the Data Documentation Initiative (DDI) Alliance, operating in the social sciences, which has its Technical Committee field requests for new standards as well as modifications to existing standards.
Use cases are also valuable for delimiting the scope and identifying any related standards space that is (or isn’t) part of the effort at hand. To prioritize areas of focus, the Neuroscience Information Framework (NIF) uses indicators of impact such as the likelihood of a wide adoption, possible enhancement or leverage of other standards efforts, and the increase in efficiency or effectiveness for the community. The ISA Commons has prioritized its efforts on different experimental types, focusing on capturing the experimental context and then creating a reference system to link to other data-specific representation formats.
In all cases, however, when a need becomes apparent, it is best to first determine whether there are possible solutions that already exist, including those that could be adopted in whole or in part, modified or extended. For example, the ISA-Tab-Nano (10) by the Nanotechnology Working Group (Nano WG) of the National Cancer Informatics Program is a direct extension of the ISA-Tab, one that is now a formal standard by American Society for Testing and Materials (ASTM). The COordination Of Standards In MetabOlomicS (COSMOS) also couples the reuse of the ISA formats (in its tabular and semantic web versions) with the repurposing of relevant formats by the HUPO-PSI. The additional social engineering work needed to facilitate the engagement between two or more efforts and to ensure all parties will get the relevant credit, is a time consuming factors but ultimately contributes to the wider adoption and interoperability among standards.
This formulating phase crucially depends upon identifying, assembling and engaging with the right people, with iterations punctuated by consultation with experts and the broader community. Some standards efforts, for example, the Computational Modeling in Biology Network (COMBINE), which coordinates standards in system biology, allows anyone to self-identify and volunteer to help, while others, such as the NCI Nano WG, actively solicits experts for particular roles.
Conducting: iterations, feedback and requests
Once the determination to move forward with a particular standards effort has been made, a core group of individuals with necessary expertise and perspectives convene numerous times to get the effort underway. Although some groups have successfully started the discussion with virtual interactions, face-to-face meetings seem crucial to galvanize a new group, enable participants to evaluate their commitment, design the initial workplan, harmonize different perspectives and explore available options such as whether existing standard(s) could be re-used, modified, or extended for use to meet the identified need. The type and frequency of group interactions depend on the type of standards developed, the granularity, coverage, and number of people actively involved. For example, reporting guidelines or recommendations – in narrative or list form – are not trivial but less demanding than highly structured exchange formats or ontologies. Once the core group roughly shapes the standards effort, additional stakeholders are engaged and the standards effort iterates forward with their multi-pronged input.
Although the ultimate indicator of progress and success is wide adoption or extension of a standard, there is almost always a significant lag between the conduct of the effort and those final outcomes. This conducting phase can be a very costly and time consuming; the absence of core funds supporting the effort in general strongly impact on the ability to disseminate the results and in broader engagement with the user community, in particular. Nevertheless, the ability to solicit testing, monitor results, and manage feedback and requests for extensions are useful intermediate milestones for assessing the progress of this phase. But these milestones must be reflected by interest and response of the community outside the initial core group.
Maintaining: implementations, education, sustainability and evolution
Community-based standards need to be maintained and evolve as appropriate. Typically, standards efforts are long-term endeavors and require updating and evolution as the science and technology to which they relate change. Responsibility for keeping efforts updated over the long term varies across groups, from the responsibility of a core group or the primary developers acting directly or appraising work, to committees elected by the members that follow very formalized processes, such as in Clinical Data Interchange Standards Consortium (CDISC) and Health Level 7 (HL7). And, mandates by government and other key actors influence greatly the evolution and sustainability of particular standards efforts (including CDISC and HL7).
It is important to encourage and accommodate interaction with stakeholders to assure that the evolution of a standards effort is serving community-wide needs. Backward compatibility with original and previous versions, via migrations or conversion modules, is crucial to ensure consistency and continuity. To this end, documentation of standards, both technical specifications and a set of examples are generally maintained and provided. These can also serve as training material in hands-on session for direct use of the standard or for implementation into tools and data resources.
While standards efforts use a number of different business models, sustainability is best bolstered by the wide acceptance and adoption of its products. Once standards become part of the fabric of research, all stakeholders have an incentive to aid in their continuance. And, just as inclusion of diverse stakeholders, including those with commercial interests (e.g., instrument manufacturers or publishers) is important in the formulation of standards efforts, continued engagement of stakeholders throughout the life cycle of a standards effort helps assure their long-term interest. Finally, because standards efforts can have long lifetimes, interested students and young scientists should be encouraged to participate. Succession is an important part of sustainability.
Realizing the potential
Today, the concepts described in a scientific paper are like the tip of an iceberg; underneath, and not widely seen, are details of methods, analyses, and the data. BD2K is set to tip that iceberg on its head and transform the biomedical research enterprise from one that is concept-centric to one that is more data-centric. This will add value to the research investment and ultimately improve health. The social and technical underpinnings of this future are now taking shape. Platforms, tool and infrastructure initiatives (e.g., NIH Blueprint for Neuroscience Research, ELIXIR), cross-sector partnerships (e.g., NIH Accelerating Medicines Partnership, IMI eTRIKS and tranSMART Foundation) and obviously the newly funded NIH BD2K centers of excellence – in particular the Center for Expanded Data Annotation and Retrieval (CEDAR) and the biomedical and healthCAre Data Discovery and Indexing Ecosystem (bioCADDIE) – are already promoting the use of community-based standards through shared data, interoperable tools and computational environments. But there is still much to do.
Our in-depth analysis of specific issues that pertain to the entire standards life cycle has provided us with valuable information (see Text Box 2) to shape frameworks for governance, administrative procedures and funding mechanisms that can routinely be used to support and promote a wide range of community-based standards efforts. Such frameworks could then be used to provide catalytic support of particularly opportune community-based standards efforts related to research data that would make an important difference across a broad spectrum of NIH-supported research.
Facilitating the work of community-based standards efforts will add value to research, and that value will be compounded if the facilitation is done in a coherent, coordinated fashion, with standards activities of a particular community being informed about and by the activities ongoing in other communities. As BD2K moves forward, an awareness of the state-of-the-standards landscape through such endeavors will be important to maintain, including via umbrella registries such as BioSharing (now also working as part of the BD2K CEDAR), but also engaging closely with the NIH BD2K centers and other global efforts, such as the pre-competitive pharma-driven Pistoia Alliance, the Global Alliance for Genomics and Health (Global Alliance), and Research Data Alliance (RDA).
TEXT BOX 1
Anatomy of Community Standards
TEXT BOX 2
Elements of Community Standards
|Just as there are elements that are key to know about in order to understand a research project (e.g., the hypotheses to be tested, the methods to be used), community-based standards efforts also have such key elements; these include:
Response of the Clinical and Translational Science Ontology Group to Frameworks for Community-based Standards Workshop Report
By William Hogan, Barry Smith, Mathias Brochhausen, Sivaram Arabandi, Jihad Obeid, and Jie Zheng on behalf of the Clinical and Translational Science Ontology Group
The working group report is an excellent overview of the current status and state-of-the-art in community-based standards efforts. However, it makes few concrete recommendations for progress and widespread creation and adoption of these standards.
Thus, the response of the Clinical and Translational Science Ontology Group (CTSOG) is focused on specific actions that NIH and other stakeholders should take to jumpstart progress. As the leadership of CTSOG has highlighted in the past, urgent action is necessary to create and adopt ontologies for the acceleration of translational science.
To emphasize a key finding of the workshop report, community is a requirement for the creation, adoption, and implementation of a standard. To the extent that a standard is defined as a specification that is actually followed by many groups and organizations, a successful community is definitional of standard, i.e., a necessary definitional criterion. We see as examples of successful communities the Gene Ontology Consortium and the VIVO community. The latter is a good example of widespread adoption of an ontology in one or more software systems.
The Clinical and Translational Science Ontology Group or CTSOG brings together scientists, informaticians, ontologists, and other key stakeholders to drive the creation and adoption of community-based ontology standards for translational science. The CTSOG began in 2012 and has held three face-to-face meetings since that time. It has addressed the need for standard ontologies in CTSA evaluation, research networking, imaging informatics, demographics, informed consent, biobanking, omics data, and clinical data warehouses. Thus, the CTSOG has significant collective experience in convening communities to generate, adopt, and maintain data standards.
The CTSOG respectfully submits the following recommendations:
1. Disseminate best practices in forming, convening, governing, and sustaining successful and active communities that create and maintain standards.
Although the Report acknowledges the significant but absolutely necessary overhead for successful community based standards efforts—use cases, documentation, communication and dissemination planning, execution of communication and dissemination plans, governance, and so on—there are few best practices published or otherwise made publicly available. The sociology of this process is certainly as important to translational science as are the sciences of informatics and ontology, but much less is known and published. Therefore, the publishing and sharing of this knowledge should be made a priority and a requirement for NIH support of community-based standard development.
The CTSOG could serve as a forum for discussing, developing, and disseminating best practices with respect to standards that support translation and translational science. CTSOG holds an annual face-to-face meeting that could easily incorporate a standing agenda item related to the science of community in standard ontology development.
2. Fund the administrative overhead of community organization and the science of discovering new best practices in community governance and sustenance.
Communities should be able to apply for support to cover the overhead of community formation, convening the community, community governance, administration, and so on. This support would be for face-to-face meetings, stakeholder engagement, infrastructure, and communication and dissemination.
All standards developed with this support should be free and open. For example, the Open Biological and Biomedical Ontologies (OBO) Foundry requires that every ontology in the Foundry be freely available and state explicitly under which open license it is released. It further recommends the use of the Creative Commons CC-BY license for ontologies in the Foundry. The CC-BY license allows anyone to copy and redistribute the ontology in any medium or format and remix, transform, and build upon the ontology for any purpose, even commercially under the following condition of attribution: one must give appropriate credit, provide a link to the license, and indicate if changes were made.
3. Amend NIH Requests for Applications and Program Announcements to mandate that the data sharing plan describe how the proposed science will make use of community-based data standards
NIH should increasingly require that the data sharing plan describe how it will make use of community-based standards in the data generated by the proposed research.
4. Encourage CTSA recipients to amend promotion and tenure criteria to give credit for meaningful participation in the development and maintenance of community-based standards
The NIH should encourage CTSA recipients to create and evaluate innovative approaches to giving academic promotion and tenure credit for contributions to community-based standards. This recommendation and the one that follows require study of the appropriate metrics and mechanisms for assigning credit. We discuss it further with the next recommendation.
5. Promote the peer review and evaluation of community-based standards
Community-based standards should undergo peer review. A key issue is who should conduct the peer review and how. The OBO Foundry is a model, with editors who review ontologies submitted for inclusion in the Foundry. The Foundry confers special status on ontologies that meet Foundry quality criteria. In this sense, the OBO Foundry is a community of communities. Each community is informed by, and informs, the work of other communities. This work too is very important and under resourced at the moment.
The CTSOG could also play a role in this peer review. In the past, we note that having groups present their ontology to the broader community for discussion and critical review in a face-to-face forum has helped improve them. Also, we note that getting credit for contributions to community-based standards for promotion and tenure credit will also help. However, metrics to understand successful standards and individuals’ contributions to them are not in widespread use or well understood. The CTSOG also could play a role in developing and tracking the success and appropriateness of various metrics.
The NIH could also amend its NIH biosketch to include a section whereby an individual lists and/or summarizes his/her contributions to community-based standards, which could be recognized under the contribution to science section in the newly released NIH biosketch format (NOT-OD-15-032).
6. Incorporate information about and index community-based standards in data, software, and other resource discovery indexes
Catalogs of datasets and software should also incorporate and index information about community-based standards. Key information about a standard for a catalog includes a contact, license under which the standard is released, last update date, and links to the standard’s home page, documentation, community wiki page, repository where the standard is maintained, and issue tracker.
However, we note that the lesson of NCBO Bioportal is that a comprehensive catalog of everything there is, can be quite confusing and unhelpful to those looking for actively maintained standards with a thriving community behind them. Therefore it will be important to incorporate only those standards that have undergone peer review, or least incorporate peer review information with each standard indexed in the catalog.
7. Invest in communities commensurate with their success
The ability to acquire increasingly large financial resources for community organization and standards development and maintenance will incentivize communities to adopt best practices at each stage of development. Thus, communities forming for the first time to consider a new standard would receive modest resources to convene whereas well established consortia who meet numerous criteria for success would receive significant resources to continue maintaining standards of high value to the scientific community.
This is an important paper hightlighting a number of issues.
I have a couple of comments related to Text box 2.
A note for tools for making, developing and evolving standards could be added.
An essential aspect of standards is training. First, training about the their importance and then about how to apply standards. Should be included to relevant educational curricula.
To be useful, the standards have to be applied. Plans for making standards approved by the community are important. It is not self evident that even a good standard will be utilized, at least to its full potential.