Over the last decades the medical sciences, along with clinical and translational research, have been transformed into data-intensive scientific disciplines. A variety of data repositories exists where diverse data sets can be found:
- IDEM Ethics in the medicine
- Gene-disease relationships – Information about gene-disease relationships
- Human Metabolome Database – Biomarkers and clinical chemistry
Data collecting and structuring
In medical and clinical research it is useful to work with common data elements (CDEs). CDEs are precise, defined sets of questions with a specified format or set of allowed values which can be interpreted unambiguously by humans or machines. These are used by online collections, repositories and other online resources to integrate clinical data from diverse studies and registries. The US Library of Medicine provides a CDE-portal where several common data elements can be found and explored.
The use of ontologies for annotating and standardizing results is also recommended. Ontologies facilitate the querying and interpretation of data. Ontologies require an intrinsic knowledge of the specific domain on which the data is being generated. The Ontobrowser allows you to browse through available ontologies in the biomedical sciences and match your results to a given ontology.
Depending on the type of generated data, different approaches to structuring and making data available may be needed. If rapid access to the data sets is desired, a file storage approach might be the best solution. If considerable amounts of research data are being generated, an object storage system which can grow in scale and capacity and has higher throughput performance should be considered.
Data formats and Metadata
Data formats and the intended data usage in the medical science are highly diverse. It is thus important to ensure that the choice of data formats and metadata suits the data delivery needs of a project. The chosen formats and metadata should support a variety of usages ranging from simple queries over an API to downloading of data sets over a browser or http protocol. The ULB provides a list of accepted and preferred data formats which can be used to ensure the readability of data in the future. The list gets updated periodically to ensure the advice given remains accurate.
Additional useful links and advice for choosing the right data formats for long-term preservation can be found via the library of congress resource.
Medical studies and projects routinely acquire data from human subjects and therefore all legal, data protection and ethical matters must be addressed and resolved. Studies should seek the approval of the ethics commission of the MLU.
A number of steps should be considered when preparing clinical data for re-usage by others. These can be summarized in a number of steps:
- Seek early approval not only from surveyed participants, but from the institution at which the research is conducted and from the corresponding sponsors.
- The data request process should be transparent. This includes the preparation of data-application forms, consent forms and a selection process which ensures that requests can be reviewed thoroughly by data creators or custodians. Only bona fide collaborators should be able to obtain access to the data.
- The anonymization of the data sets should be conducted taking into account the agreements of the consent forms and the infrastructure available for transferring the data and should be overseen by groups or individuals with good data management skills and experience with quality control processes.
- Data should be made available via secure-data-transfer methods.
- Supporting documentation should be made available with the data sets.
- A data user agreement and licensing should be used to ensure that 1) no attempts are made to re-identify or contact trial participants, 2) data creators are acknowledged and cited correctly, and that 3) the approved conditions upon which the dataset can be used are clearly outlined for other users.