During the planning stages for a new project, the status quo of available data sets for your study object or site should be evaluated. Even if an initial amount of data has been collected for a given project, it may be necessary to add further data from other authors, regions, species etcetera. Here below are some useful databases and resources which you can use to find data.
- Dryad: International repository of data underlying peer-reviewed articles in the basic and applied biosciences
- ITIS: authoritative taxonomic information on plants, animals, fungi, and microbes
- Protein DataBank: A worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids.
- Gfbiol Data Search: Search engine of the German Federation for Biological Data
- Knb The Knowledge Network for Biocomplexity Repository
Data collecting and structuring
Data can be collected manually or automatically in the laboratory/field, through literature or through repository searches. Before collecting data, set up a collection protocol/sampling strategy and make sure you collect data in a consistent, systematic manner. Here are some example templates for the submission of environmental, occurrence, and molecular sequence data to the German Federation for Biological Data (GFBio).
Define which data will be created/collected (What? Where? When? Who? How?)
Data formats and Metadata
Securing the raw data for research projects is an important endeavor which ensures the research conducted remains transparent and the data development methods are in line with good scientific standards. Ensure raw data is kept unencrypted, uncompressed (or compressed using known standards) and that the formats used are, whenever possible, open and well documented.
The following data formats are recommended when working with biological data sets:
- CSV, TXT, JSON oder JSON-BON (for phylogenetic or sequence data)
- FASTA (for bioinformatics und biochemical data)
- XML, CSV, TXT (for databases, environmental / ecological data)
When developing metadata for a research project in the biological sciences six questions should be answered (Who? How? Where? When? What? Why?). Afterwards metadata should also be converted to a compatible standard (e.g. EML, ABCD).
The sharing of research results is a requisite for many funding bodies these days. Many journals also require datasets to be openly available and released at the time of publication. Preparing your data to meet these requirements could be a daunting and lengthy process, not only for starting scientists, but also for well-established groups. That’s why it is important to use, whenever possible, available tools such as web-based wizard-like graphical interfaces and other tools that allow researchers to easily describe and prepare data for submission. Here are examples from the GFBio and the Galaxy projects.
When sharing data it is very important to give proper credit to data creators and to make sure your research outputs are attributed, cited, and tracked correctly.