Empa - DigitalScience - Data Capture and Documentation

Data Capture and Documentation

1 Data capture

Data capture should always be done in a consistent way. It is strongly recommended to develop Standard Operating Procedures (SOP) that clearly define the steps to be taken and outlines roles and responsibilities. SOPs are useful even for single person projects to ensure that there is consistency over time. Beside informing on experimental setup and applied procedure, SOPs should also indicate when to create documentation and where and how files are named.

2 Code versioning

In cases where data capture and analysis relies strongly on data processing (implemented in any compiled general-purpose programming language such as C/C++, FORTRAN or interpreted languages such as python, R, matlab) a version control software (e.g., svn, git) may provide significant benefits regarding versioning, documentation, bug fixes, and released/applied versions. The use of a version control software is highly recommended for software that is structurally complex (significant number of modules or classes), contains original numerical algorithms, is supposed to be developed and applied over a longer period and is coded simultaneously by several developers. Such versioning systems are NOT thought to be used for data themselves, but can be used for any kind of ASCII file-based coding project (incl. Latex manuscripts or web content).

For Empa internal code development (only Empa users), Empa ICT offers a GitLab server to all Empa users (https://gitlab.empa.ch).

3 Dataset Documentation (Metadata)

Describing and documenting data is the only ways to ensure that they will be discoverable, searchable, and re-useable in the future. This documenting is often called metadata and includes all relevant information.

Commonly there are two types of documentation (UK Data Archive – Document your data) that ensure usability far in to the future. They are:

Descriptive or study-level documentation
Structural or data-level metadata

3.1 Metadata – Descriptive or study-level documentation

These metadata describe the content and purpose of the study in a comprehensive way. They include the goal of the study, the design of the study with all of its experiments and measurements, its major findings and the conclusion. Based on these metadata information it should be feasible to comprehend the study undertaken and its outcome.

3.2 Metadata – Structural or data-level metadata

These metadata detail the actual measurement with its samples and references, the employed SOP and the comprehensive data analysis. It is their purpose to illustrate each of the measurement undertaken. Furthermore, they should be so detailed that any other institution is capable to fully reproduce the measurement results. The reproducibility of measurement results and whole studies is one of the essential items of RDM.

3.3 What should Metadata contain

Documentation needs will vary by project and by discipline. Many disciplines have developed metadata standards that specify what information should be collected. If there isn't an existing standard, a template should be created that will record all the important details of the data. At the minimum it should include the following keywords:

General Information:

Title
Creator: names and addresses of data creator(s)
Publisher: addresses of data publishers
Identifier: can be a permanent identifier or an internal project number
Funder
Rights: Intellectual property or licensing rights for the data
Access Information
Language
Project description (e.g. subject, scope etc.)
Data Citation: Preferred format for citing data.

Data and file overview:

Data Structure: including relationships between files
File description: A short description of each file
Dates that the file was created

Methodological information:

Measurement process: Description of measurement setup and measurement method

Data capturing: Description of methods for data capturing
Data processing: Description of methods for data processing (if data is not raw data)

Data specific-information:

Variable list: with full names and definitions of column headings if tabular data
Measured quantity (measurand)
Units of measurement: IS units of the measurement result
Location of measurement
Definitions: Definitions for codes or symbols used to record missing information

This list is based on MIT Documentation and Metadata guidance, UK Data Archive Study Level Documentation, and Cornell University, Guide to writing "readme" style metadata. This meta-data information should be included as documentation in a README.txt file in the folder with the data files.

4 README Files

These metadata are often collected in readme files, which are plain text files (.txt) or sheets in a spreadsheet. It helps others to understand the research data and interconnections among data files. By titling the file "readme," the date creator informs to users that this file should be looked at first. For researchers depositing data in a data repository, the information in the readme file augments information included in the metadata form. Furthermore, if the deposit includes multiple files, it explains the file naming structure, relationship among the files, and abbreviations used. Cornell University's Research Data Management Service Group has made a useful readme file template available for download.

5 Metadata Standards

There are a number of community-maintained lists of disciplinary metadata standards:

General collection or definition of metadata
- Research Data Alliance (RDA) - Metadata Directory
- Digital Curation Center (DCC)
- Dublin Core - domain independent, basic and widely used metadata standard
- Fairsharing.org - The standards in FAIRsharing are manually curated from a variety of sources
- DataCite Metadata Schema (.pdf - V4.1)
Biological science
- Minimum Information for Biological and Biomedical Investigations (MIBBI)
- MINimal information about high throughput SEQeuencing Experiments (MINSEQE) - Genomics standard
Ecological science
- Ecological Metadata Language (EML) - specific for ecology disciplines
Geographic information
- ISO 19115-1:2014 Geographic information - Metadata - Part 1: Fundamentals
- Federal Geographic Data Committee's Content Standard for Digital Geospatial Metadata (FGDC-CSDGM)

Social, behavioral, economic, and health sciences
- Data Documentation Initiative (DDI) - common standard for social, behavioral and economic sciences, including survey data