Neuroimaging data is currently disseminated in a variety of formats and via different platforms. Raw magnetic resonance imaging datasets to test new algorithms or explore new hypothesis can be found via resources such as OpenfMRI. To obtain carefully curated multi-modal data sets from large scale studies the Connectome Coordination Facility is a useful start point.
The neurosynth platform is a useful tool to find data on peak activation coordinates reported throughout the literature. Another useful online resource to find unthresholded statistical maps, parcellations and atlases produced in MRI studies is the Neurovault platform.
Data collecting and structuring
Raw and derived data
For data to be effectively usedand to increase its re-use potential, it must be adequately organized, saved, prepared and documented.
When collecting data it is important to clearly differentiate between raw and derived data. The latter will likely be processed and post-processed so that intermediate and final data versions will be available and these need to be structured and identified correctly. A naming convention that clearly identifies the type of data (correlation maps, brain masks, T1, T2 weighted, etc.) is recommended and is in line with some widely used community data structuring standards such as BIDS.
All code used for collecting data should be stored in a version control system. The entire analysis workflow should be completely automated in a workflow engine packed in a software container or virtual machine to ensure computational reproducibility.
Data sets should be versioned to improve provenance and should be subjected to automated periodical quality control to assess the correctness of the dataset and detect potential errors.
Ethical and legal aspects
In neuroscience data from subjects (participants or patients) are routinely acquired in the context of studies or projects to address a specific research question. Usually an agreement between researchers and subjects will be made which takes the form of a subject-specific consent form. These forms are valuable instruments for both parties to specify exactly how and for what purposes the data will be acquired, and also what will happen to the data throughout a project, including any plans for future sharing. If the consent form is written cleverly, it is possible to secure the usage of the data for future analysis and research questions. The Open Science Team of the ULB offers guidance with the writing of consent forms. Further information about consent forms and templates can be found here.
The technical element associated with ensuring that a data set is safe for sharing must not be overseen. Often, data sets will need to be anonymized with the degree of anonymization being specific for each research project.
Data formats and Metadata
For raw imaging files DICOM is the most widely used standard where as NIfTI is the file format for derived data which is supported by various neuroimaging data analysis packages. Other available data formats are MINC and NRRD.
If the NIfTI format is used it is recommended to provide sufficient metadata from the raw data in additional files (e.g. *.TXT or *.JSON). A number of converters are available to transform DICOM files into the NIfTI format.
Metadata play a crucial role in conveying not only important study information but also technical parameters. In both cases tabular files – tab or comma-delimited files (*.CSV, *.TXT) are widely accepted as metadata file formats.
The amount of metadata will depend on the type of analysis (and data processing) conducted. Typically, anatomical data will often require far less metadata than functional studies (fMRI). Tools to automatically extract metadata from raw data and to validate the completeness of the extracted metadata are available from their corresponding NITRC and GITHUB project links.
The International Neuroinformatics Coordinating Facility (INCF) is responsible for the development of an international infrastructure to promote the sharing of data but also computational resources. The INCF endorse a set of best practices for data sharing in the field of neuroinformatics. The INCF best practices focus on the adoption of open standards, citability for data, and the implementation of tools which are well described, supported, and adopted by the community.
When disseminating your results, all analyses conducted in the data set should be reported. Whenever possible, data sets should be shared using appropriate community-wide standards such as BIDS and annotated using an ontology approach or a similar method which allows for a semi-automated machine driven analysis. Before releasing a first version of the data a quality assurance test should be performed.
Licenses for data sharing
Sufficient information about the conditions under which data sets are made available must be available. While the Creative Commons provide a number of licenses which are compatible and useful for the open data movement, most of these apply to documents with the CC0 being the only data-driven CC license available. Other data licensing options to consider are the Open Data Commons and the Open Definition initiatives.