This page introduces the guiding principles for the MultiCellDS project and core concepts:
Motivation for multicellular data standards
If computational modeling is to reach its fullest potential in (multicellular) biology and medicine, we must make advances in extracting biophysical parameters from experimental and clinical data, using these measurements to seed computational models, analyzing model outputs to make predictions, and quantitatively comparing model predictions to biomedical data. As experimental measurements come with higher throughput and become more quantitative, it will be necessary to accomplish these tasks consistently, efficiently, and automatically.
Morever, repositories of shared experimental, clinical, and simulation data—with a suite of standardized data processing tools—will be needed to drive the next generation of predictive computational modeling.
Likewise, as more and more image-based multicellular data are created by high-throughput experiments and clinical studies, it is essentially that we find a commmon language to communicate those data and open them up to novel data analyses. Those data must be connected to contextual information—the experimental conditions, cell lines used, who performed them, and with what software—to allow research transparency and reproducibility.
These issues can be addressed by developing standardizations for multicellular data, shared preprocessing and postprocessing tools support these standards, and repositories of shared data. This is our motivation for the MultiCellDS Project: an effort to create a multicellular data standard, along with tools and a repository.
The MultiCellDS Project aspires to promote data sharing in computational and experimental biology and medicine, particularly cancer. We aim to foster comparison, refinement, and recombination of models that can better understand biology and predict disease progression.
To achieve these goals, MultiCellDS operates under these guiding principles:
- The data standard should be based on physically-motivated data elements that are not specific to any computational or mathematical model.
- The data standard should focus on representing fundamental biological components: cells, blood and lymphatic vessels, extracellular matrix, and key molecular substrates.
- The data standard should leave model representation to other standards efforts.
- The data elements should be documented with human-readable definitions, references, change history, and suggested units and measurement methods. (This is still in progress.)
- The data standard should encourage development of many independent models that can read and initialize simulation and analyses from the same spatial and phenotypic data.
- The data standard should encourage recombination of independent model outputs into larger ensemble models that can account for modeling and data uncertainty.
- The data standard should ease creation of shared analytical, postprocessing, and visualization tools that can read and write the shared data format.
- The project's root goal is data sharing; it will create an ontology, but not as its primary endpoint.
- A pragmatic, incremental approach will speed progress towards a working standard.
- The standard should start with a small core of clearly important data elements, and expand to include data elements as needs emerge.
- The data standard should allow encoding markup languages like XML, relational databases, and compressed hierarchical formats like HDF.
MultiCellDS vs. MultiCellXML
MultiCellDS grew from the original MultiCellXML project. MultiCellDS is the data specification and includes:
- allowable data elements
- relationships among data elements
- data element attributes and quality measures
- appropriate metadata
- overall structuring of elements and metadata.
MultiCellDS data can be stored in XML files (MultiCellXML) or in a repository (MultiCellDB). In the future, MultiCellDS data will also be saved in HDF format (MultiCellHDF) to allow better data compression.
Digital cell lines
A key problem facing computational biologists is a lack of standardized recording of cell phenotypic properties. Moreover, we do not have standardized model cell systems for computational experiments. MultiCellDS aims to solve this with digital cell lines: the digital analogue of an experimental cell line.
A digital cell line is an extensible, standardized representation of a cell line. It includes cell phenotypic parameters (e.g., cell cycle and volume data elements), along with information on the microenvironmental context (e.g., oxygenation). These data elements may be recorded in several microenvironmental conditions and grouped in a digital cell line.
This data model reflects the physical biology origins of MultiCellDS, and we intend digital cell lines to broadly sample the microenvironmental space to include different combinations of hypoxia/oxygenation, matrix stiffness, signaling factors (e.g., those secreted by co-cultured cells), and therapeutic compounds. In the future,these phenotype datasets will be combined with molecular descriptions of the cell's internal molecular state and embedded as SBML, BioPax, or other well-established standards for subcellular and systems biology. The net result: systematic, multiscale characterizations of cell behavior, external environment, and internal state. We envision broad multiscale modeling and data mining possibilities with such richly characterized digital cell lines.
A digital cell line is a data model and not a computational model. It gives an orderly, standardized recording of key cell phenotypic and biophysical characteristics, and leaves model building and interpretation to modelers. MultiCellDS aims to provide a curated library of digital cell lines that are constantly improved by community-contributed, peer-reviewed measurements. We hope to reduce unnecessary duplication of experimental work, increase data sharing, and free computational modelers to focus on building, calibrating, and improving their models.
A digital snapshot records the current state of an experiment or a simulation. It includes key metadata (e.g., user information, software information, experimental setup, citation information), a list of digital cell lines involved, a list of all cells and their current phenotypic state, and the current state of the microenvironment . Other data aggregations (e.g., bundling a time course of digital snapshots) are also being developed.
Comparison to related standards
There are several similar (but non-overlapping) standards to consider.
SBML (Systems Biology Markup Language) and BioPAX (Biological Pathways Exchange) describe biology at the subcellular scales (with a focus on molecular biology and signaling pathways), whereas MultiCellDS focuses primarily upon describing the (biophysical) phenotype of multicellular systems and the microenvironment. In short, MultiCellDS focuses at describing biology from the scale of a single cell and up in size, while SBML and BioPAX focus on detailed descriptions within a single cell. In future developments of MultiCellDS, we plan to incorporate subcellular descriptors that integrate with (and can embed) SBML, BioPAX, and related subcellular data standards.
The CBO (the Cell Behavior Ontology) is a standard for specifying computational biological models. This is complementary to MultiCellDS, which records simulation and experimental data and parameters, but leaves questions of modeling (and description of modeling) as out of scope. Future instances of MultiCellDS may embed CBO in simulation snapshots to describe the models used to generate the results.
Similarly, CellML (Cell Markup Language) is focused primarily on specifying and exchanging mathematical models, rather than model data. It has an excellent repository of mathematical models at models.cellml.org.