Empa - DigitalScience - Data Sharing, Preservation, and Citation

Data Sharing, Preservation and Citation

1 Data Sharing versus Open Data

Many funders and journals have data sharing requirements and others call for open data. Managing the research data throughout the project can help ensure that either of these goals can be met. If one is working with a specific funder or journal it is important to check their exact requirements.

1.1 Data Sharing

Data Sharing encompasses the spectrum from making data available upon specific request to depositing data in an open and publicly accessible repository. It is important to know specifically what is required by a funder, journal, or institution. For example, one definition of data sharing is making data available to people other than those who have generated them. This can range from bilateral communications with colleagues, up to providing, free unrestricted access to the public through on a web-based platform.

1.2 Open Data

Open Data is data that is deposited in an open, publicly accessible repository. In particular, the Open Definition summarizes open data as “A piece of data or content is open if anyone is free to use, reuse, and redistribute it- subject only to the requirement to attribute and/or share-alike". Their full definition includes several detailed points that address issues such as access, reuse, redistribution, licensing, technological restrictions and more.

The Panton Principles are a set of recommendations for making research data open in science. They state to support the position on open data:

Science is based on building on, reusing and openly criticizing the published body of scientific knowledge.
For science to effectively function, and for society to reap the full benefits from scientific endeavors, it is crucial that science data be made open.
By open data in science we mean that it is freely available on the public internet permitting any user to download, copy, analyze, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. To this end data related to published science should be explicitly placed in the public domain.

2 Research Data Preservation

Preservation is done at the completion of a project. Ideally the preservation strategy reflects long term thinking and should not be the same as how the data were stored during the project. Things to consider when developing a plan for preservation include:

What does one need to keep?
What are requirements of the funders or of the journals?
How long does the data need to be preserved? (Empa management-handbook, SNFS)
Who is responsible for the data at the end of the project?
Does the funder or journal specify a repository?
Is there sufficient documentation that anyone can use the research data without any assistance, including software needed and file structures?
Are the file formats open and sustainable?
If the research data are not deposited in to a repository, what is the shelf-life of the hardware and when will data need to be migrated

The Digital Curation Centre has further recommendation:

How to appraise and select research data for curation guide (.pdf)
Five steps to decide what to keep guide (.pdf) helps determining what research data should be retained

3 Future File Usability

Thinking about future file usability will ensure that the data are still usable and can be shared in the future. Important items to be consider include:

Is the file format open or closed?
Is a specific software package required to use the data?
Do the multiple files comprise the previously defined data file structure?
Will somebody be able to open the file 10 years from now?

One should select a consistent file format that can be read well into the future. This means that the file format is open, has documented standards, is unencrypted, and is if feasible uncompressed, otherwise with a lossless compression.

ASCII formatted files will be readable well into the future
.docx -> .txt
.xlsx -> .csv
.jpg -> .tif
Save a copy in the original format, just in case

4 Data Repositories

Research data repositories host, provide persistent access to, and preserve datasets. For many disciplines, there are repositories familiar to and well-used by researchers in the field. In addition to considering disciplinary practices around data deposit, researchers should determine whether their funder or publisher requires or recommends a specific data repository for archiving and making data available. There are three main types of data repositories available:

Disciplinary repositories - to check for a repository in a given discipline it is best to search in the Registry of Research Data Repositories (re3data.org). It can be browsed by discipline, data type and country for discovering an appropriate home for their data or to find shared datasets to use in their research.
General repositories: Zenodo
Institutional repositories

5 Research Data Citation

Citing research data in a manner similar to traditional scholarly works can help ensure proper attribution, improve reproducibility, improve discoverability, and help provide credit for research data as a scholarly output. According to the Force11 examples of the joint declaration of data citation principles should be cited as follows:

Include an in-text citation near the claims relying on the data in the form of the citation style required by publisher. Additional information may also be included in the in-text citation, such as portion of data set used. e.g. [Author(s), Year, Portion or Subset of Data Used].
Full citations should be included in the reference list, following the format of the required citation style. The DCC guide “How to Cite Datasets and Link to Publications” (.pdf) provide a comprehensive list of data citation elements. If no format exists, Force11's examples recommend: Author(s), Year, Dataset Title, Data Repository or Archive, Version, Global Persistent Identifier.
Permanent identifiers, such as DOIs or ARKs, should be given in the form of a linked URL, if possible.
Data sets are cited at the most detailed level possible and provide version, if appropriate.
When citing a dataset, the repository should be notified so a link can be added to your paper, if possible.

6 Permanent Unique Identifier

Permanent Identifiers provide a way to provide a persistent and so permanent link to a research dataset or other digital object regardless of hardware or domain changes a repository may make over time. Persistent identifiers are generally provided when data are deposited into a repository. There are several types of persistent identifiers currently used, including HanDLes (HDL), Archival Resource Keys (ARKs), Persistent URLs (PURLs), and Digital Object Identifiers (DOI). Most researchers are familiar with DOIs as this is the system used for most electronic journal articles. The California digital library's webpage on Understanding Identifiers provides more information on persistent identifiers (DOIs).