Table of Contents
- Considering community-based standards
- An evolving landscape with a complex life cycle
- Formulating: use cases, scope and prioritization
- Conducting: iterations, feedback and requests
- Maintaining: implementations, education, sustainability and evolution
- Realizing the Potential
Through a number of activities and events, the National Institutes of Health (NIH) Big Data to Knowledge initiative (BD2K) will make the biomedical research enterprise more data-centric, with a vision for findable, accessible, citable, interoperable and reusable datasets that are linked with other research products to fuel new investigations and discoveries.
Here we report on the importance of data and metadata-related standards in this ecosystem, and the enabling role of community-based standards efforts. Specifically, we analyze issues that pertain to their entire life cycle, from formulation to adoption and maintenance, and introduce initial opportunities for BD2K to promote and encourage these crucial endeavors.
Biomedical research generates large amounts of complex and diverse data – big data. While the vast majority of these data remain in the labs that produced them, some areas of biomedical research, such as genomics, have a tradition of making data broadly available. The process is often guided by established policies, such as that for NIH Genome Wide Association Studies. In these areas, the availability of data – and seizing opportunities that availability affords (e.g., 1) – has had a major impact on the rate, quantity and quality of scientific progress and its impact on society. In recognition of this, individuals and organizations around the world are rallying to further enhance this paradigm (e.g., 2, 3).
Today, high returns on research investment, like those seen in genomics, are increasingly expected by funding organizations and those who benefit from research. But, the major public products of most of today’s biomedical research enterprise are still limited to concepts, such as hypotheses, interpretations and conclusions, described in scientific papers (e.g., 4). The data underlying these concepts are rarely available, so today’s enterprise is concept-centric.
New technical capabilities and scientific opportunities in biomedical research are making data sharing data both easier and more fruitful. Changing perspectives and expectations of society have encouraged science policy initiatives to increase access to data, reproducibility of research results, and return on investment in research. These converging trends will likely translate into a dramatic increase in the broad availability of biomedical research data. But, even when data are available, they are often not reusable or reproducible by independent investigators due to incomplete annotation of sample descriptions and data processing, or the use of ad hoc or proprietary formats and terminologies. Available data need to be rendered broadly usable – able to work with other data, tools and data resources – through the use of data-related standards. In order to make best use of data-related standards, investigators need to know what standards are available for use. And, in those cases where attention is required to catalyze a standards-related effort with potentially significant impact, support should be available. Such awareness and support would represent important steps in making broadly available data broadly usable and, in turn, making tomorrow’s biomedical research enterprise more data-centric.
Considering community-based standards
The BD2K initiative promises to play a major role in transforming the nature of the biomedical enterprise through three major thrusts. First, it will enable scientific innovation by advancing the science and technology of biomedical big data through the support of centers of excellence and research project grants, as well as by expediting the use of large-scale computing in biomedical research. Second, it will enhance and develop the workforce in biomedical big data through a variety of innovative training initiatives aimed at levels from students to senior investigators. Third, BD2K will facilitate the broad use of biomedical data by working to change NIH policies, practices and culture as they relate to data management, sharing and use. This will include careful consideration of community-based standards efforts.
Therefore gaining a clearer understanding of such dynamic community efforts and how NIH might best relate to them is pivotal and was the scope of this BD2K workshop on Frameworks for Community-based Standards Efforts. Note that the focus of the workshop was limited to standards to represent and share data/results files and the associated contextual metadata (e.g., minimum reporting requirements, common metadata, terminologies – controlled vocabulary, taxonomy, thesaurus, ontology – file formats and conceptual models). Clearly many more types of community-based standards efforts exist (see Text Box 1), and there is no standard way to classify and organize them.
An evolving landscape with a complex life cycle
Mapping the landscape of community-based data and metadata standards is not a simple task. Most efforts are long-term endeavors with many different players and types of efforts. Broadly, these standards initiatives work to allow data sets to be harmonized with regard to structure, formatting and annotation so as to open their content to transparent interpretation, an understanding of how and why the conclusions of a scientific paper were drawn from the data collected. They also facilitate reuse of data, as well as the integrative analysis and comparison to other data sets. Finally, such efforts also interlink data with other research products, such as scholarly material, algorithms and software.
Community-based standards endeavors are well-served by diverse perspectives, as such diversity exposes the spectrum of real-world needs to address. When standards efforts are not only developed for a research community, but also by that research community – with the engagement of a wide range of diverse stakeholders – the likelihood that the standards will be successfully implemented is greatly increased (e.g.5, 6, 7, 8).
An inevitable consequence of having multiple community-based efforts is that they are fragmented, sometimes duplicating efforts of others and sometimes leaving gaps in domains that should be covered. For example, currently, descriptions of experiments in which a sample has been subject to several kinds of assays, using a variety of technologies, are particularly challenging to share as coherent units of research because of the variety of standards for reporting data and experimental metadata with which the parts must be formally represented. Regardless of the context of a particular standards effort, it is important that standards are interoperable to allow research objects to work with one with another, in order to provide consistency across data, tools, codes and resources.
For this reason, awareness of which groups are doing what is vital to a coordinated approach; and, since the activities of any group changes over time, such awareness must be continually updated. For data and metadata, BioSharing provides that at a glance view of almost 600 community-developed content standards in the broad life sciences, linking them to database resources to monitor their use and working to assign criteria for adoption (9). Defining maturity of a standard, for example, is critical to its adoption, so that it can be channeled to the appropriate stakeholder community for use. The ownership of open standards is problematic in broad, grass-roots collaborations, but new business models, including encouraging the involvement of commercial entities, would mitigate these problems. And, the unfunded nature of many standards efforts requires methods and processes to assign formal rewards and provide incentives for all contributors that operate in a voluntary manner.
The challenges, however, extend well beyond these points. Analysis of the community-based standards life cycle has revealed that different issues pertain to each phase (i.e., formulating, conducting and maintaining standards efforts); communities’ social and technical approaches to common problems are also quite diverse. An understanding of such a complex ecosystem of players and types of efforts, and their complex operational life cycle, is pivotal if we are to realize the potential that community-based standards can offer to the biomedical research enterprise of the future.
Formulating: use cases, scope and prioritization
Community-based standards efforts typically start with the identification of a need for standards, often in recognition that not meeting that need has significant opportunity costs. Use cases, sometimes presented in the form of stakeholder stories, are valuable for defining the breath and depth of the requirements – making clear what purpose the standards effort is meant to address. For example, a public health research project might use the standards that are required by a government agency for similar data (such as those required by the USA Centers for Disease Control in public health reporting) to allow for interoperability with and comparison to a much larger data set.
The need for a particular standards effort, such as a common terminology for human diseases, is typically driven by the needs of the community of practice being made known through any of a number of different channels. For example, direct observation of problems in the research community is employed by the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI), the Bioinformatics Resource Centers (BRCs), and PhenX. Organized discussions with polling of the community are the approaches taken by the discussion forum of the Metabolomics Society and the Genomic Standards Consortium (GSC). And, the use of a standing body to which requests for various standards efforts are made is exemplified by the Data Documentation Initiative (DDI) Alliance, operating in the social sciences, which has its Technical Committee field requests for new standards as well as modifications to existing standards.
Use cases are also valuable for delimiting the scope and identifying any related standards space that is (or isn’t) part of the effort at hand. To prioritize areas of focus, the Neuroscience Information Framework (NIF) uses indicators of impact such as the likelihood of a wide adoption, possible enhancement or leverage of other standards efforts, and the increase in efficiency or effectiveness for the community. The ISA Commons has prioritized its efforts on different experimental types, focusing on capturing the experimental context and then creating a reference system to link to other data-specific representation formats.
In all cases, however, when a need becomes apparent, it is best to first determine whether there are possible solutions that already exist, including those that could be adopted in whole or in part, modified or extended. For example, the ISA-Tab-Nano (10) by the Nanotechnology Working Group (Nano WG) of the National Cancer Informatics Program is a direct extension of the ISA-Tab, one that is now a formal standard by American Society for Testing and Materials (ASTM). The COordination Of Standards In MetabOlomicS (COSMOS) also couples the reuse of the ISA formats (in its tabular and semantic web versions) with the repurposing of relevant formats by the HUPO-PSI. The additional social engineering work needed to facilitate the engagement between two or more efforts and to ensure all parties will get the relevant credit, is a time consuming factors but ultimately contributes to the wider adoption and interoperability among standards.
This formulating phase crucially depends upon identifying, assembling and engaging with the right people, with iterations punctuated by consultation with experts and the broader community. Some standards efforts, for example, the Computational Modeling in Biology Network (COMBINE), which coordinates standards in system biology, allows anyone to self-identify and volunteer to help, while others, such as the NCI Nano WG, actively solicits experts for particular roles.
Conducting: iterations, feedback and requests
Once the determination to move forward with a particular standards effort has been made, a core group of individuals with necessary expertise and perspectives convene numerous times to get the effort underway. Although some groups have successfully started the discussion with virtual interactions, face-to-face meetings seem crucial to galvanize a new group, enable participants to evaluate their commitment, design the initial workplan, harmonize different perspectives and explore available options such as whether existing standard(s) could be re-used, modified, or extended for use to meet the identified need. The type and frequency of group interactions depend on the type of standards developed, the granularity, coverage, and number of people actively involved. For example, reporting guidelines or recommendations – in narrative or list form – are not trivial but less demanding than highly structured exchange formats or ontologies. Once the core group roughly shapes the standards effort, additional stakeholders are engaged and the standards effort iterates forward with their multi-pronged input.
Although the ultimate indicator of progress and success is wide adoption or extension of a standard, there is almost always a significant lag between the conduct of the effort and those final outcomes. This conducting phase can be a very costly and time consuming; the absence of core funds supporting the effort in general strongly impact on the ability to disseminate the results and in broader engagement with the user community, in particular. Nevertheless, the ability to solicit testing, monitor results, and manage feedback and requests for extensions are useful intermediate milestones for assessing the progress of this phase. But these milestones must be reflected by interest and response of the community outside the initial core group.
Maintaining: implementations, education, sustainability and evolution
Community-based standards need to be maintained and evolve as appropriate. Typically, standards efforts are long-term endeavors and require updating and evolution as the science and technology to which they relate change. Responsibility for keeping efforts updated over the long term varies across groups, from the responsibility of a core group or the primary developers acting directly or appraising work, to committees elected by the members that follow very formalized processes, such as in Clinical Data Interchange Standards Consortium (CDISC) and Health Level 7 (HL7). And, mandates by government and other key actors influence greatly the evolution and sustainability of particular standards efforts (including CDISC and HL7).
It is important to encourage and accommodate interaction with stakeholders to assure that the evolution of a standards effort is serving community-wide needs. Backward compatibility with original and previous versions, via migrations or conversion modules, is crucial to ensure consistency and continuity. To this end, documentation of standards, both technical specifications and a set of examples are generally maintained and provided. These can also serve as training material in hands-on session for direct use of the standard or for implementation into tools and data resources.
While standards efforts use a number of different business models, sustainability is best bolstered by the wide acceptance and adoption of its products. Once standards become part of the fabric of research, all stakeholders have an incentive to aid in their continuance. And, just as inclusion of diverse stakeholders, including those with commercial interests (e.g., instrument manufacturers or publishers) is important in the formulation of standards efforts, continued engagement of stakeholders throughout the life cycle of a standards effort helps assure their long-term interest. Finally, because standards efforts can have long lifetimes, interested students and young scientists should be encouraged to participate. Succession is an important part of sustainability.
Realizing the potential
Today, the concepts described in a scientific paper are like the tip of an iceberg; underneath, and not widely seen, are details of methods, analyses, and the data. BD2K is set to tip that iceberg on its head and transform the biomedical research enterprise from one that is concept-centric to one that is more data-centric. This will add value to the research investment and ultimately improve health. The social and technical underpinnings of this future are now taking shape. Platforms, tool and infrastructure initiatives (e.g., NIH Blueprint for Neuroscience Research, ELIXIR), cross-sector partnerships (e.g., NIH Accelerating Medicines Partnership, IMI eTRIKS and tranSMART Foundation) and obviously the newly funded NIH BD2K centers of excellence – in particular the Center for Expanded Data Annotation and Retrieval (CEDAR) and the biomedical and healthCAre Data Discovery and Indexing Ecosystem (bioCADDIE) – are already promoting the use of community-based standards through shared data, interoperable tools and computational environments. But there is still much to do.
Our in-depth analysis of specific issues that pertain to the entire standards life cycle has provided us with valuable information (see Text Box 2) to shape frameworks for governance, administrative procedures and funding mechanisms that can routinely be used to support and promote a wide range of community-based standards efforts. Such frameworks could then be used to provide catalytic support of particularly opportune community-based standards efforts related to research data that would make an important difference across a broad spectrum of NIH-supported research.
Facilitating the work of community-based standards efforts will add value to research, and that value will be compounded if the facilitation is done in a coherent, coordinated fashion, with standards activities of a particular community being informed about and by the activities ongoing in other communities. As BD2K moves forward, an awareness of the state-of-the-standards landscape through such endeavors will be important to maintain, including via umbrella registries such as BioSharing (now also working as part of the BD2K CEDAR), but also engaging closely with the NIH BD2K centers and other global efforts, such as the pre-competitive pharma-driven Pistoia Alliance, the Global Alliance for Genomics and Health (Global Alliance), and Research Data Alliance (RDA).
TEXT BOX 1
Anatomy of Community Standards
TEXT BOX 2
Elements of Community Standards
|Just as there are elements that are key to know about in order to understand a research project (e.g., the hypotheses to be tested, the methods to be used), community-based standards efforts also have such key elements; these include: