Formatting and Naming Data

1 Choosing a Data Format

The research equipment, computer hardware, and software often determine the format of the digital data files. However, converting to a non-proprietary file format improves for preservation and access depending on the format and the tools, which are used. There are software tools, which perform these data format conversation automatically.

Recommended file formats that best support sharing, reuse, and preservation are formats that are open source, software-neutral, unencrypted, if feasible uncompressed otherwise lossless compressed, and in use within disciplinary communities. Some area  Stanford University Libraries Data Management Services has made a useful overview of recommended file formats available. The following listing shows the suffix of the related data files:

  • Containers: TAR, GZIP, ZIP
  • ASCII-Databases: XML, CSV
  • Geospatial: SHP, DBF, GeoTIFF, HDF5, NetCDF
  • Moving images: MOV, MPEG, AVI, MXF
  • Sounds: WAVE, AIFF, MP3, MXF
  • Statistics: ASCII (i.e. .txt), DTA, POR, SAS, SAV
  • Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP
  • Tabular data: CSV
  • Text: XML, PDF/A, HTML, ASCII, UTF-8 (i.e. .txt)
  • Web archive: WARC

During the planning phase[1] the following considerations might be useful:

  • Is the data reliant on proprietary software to access it?
    • If yes, preserving a lossless copy in an open, sustainable file format will help to ensure that everybody can access the data in the future.
    • Beside that it is recommended creating a copy of the original data format with a copy of the (instrument) software to open the original data file.
  • If the research data are deposited in a repository at the end of a project, then the repository might have specific guidelines or requirements with respect to file format.
    • If yes, a copy is created in the required format for deposit and the conversion is documented for the users.
  • Will converting to another file format modify the data or cause a loss of features?
    • If yes, then it is strongly recommended creating a copy in an open format but preserving the original data format together with a copy of the (instrument) software to open these data files.

2 File Naming

Planning the naming of the files makes finding of files easier, avoids duplication, and helps to finalize projects quicker. When naming files, the following should be considered:

  • The data files should be named in a consistent manner
  • The whole project team has to know the naming convention
  • Files should receive a meaningful, descriptive name. A file name might include a combination of elements, such as type of equipment used, date, and researcher's surname
  • The best order for elements in a file name should be decided; it will affect how the files are sorted
  • Brief file names should be used
  • Underscores instead of spaces should be used to separate words/dates
  • Letters and numbers, rather than special characters like ~ ! @ # $ % ^ & * ( ) ` ; < > ? , [ ] { } ‘ “ are strongly recommended

Repositories, where research data of a certain field might be preferably stored, could have their own naming rules. Therefore, it is advisable check this before the project start and to take them into account, when outlining the naming rules.

3 Versioning

Versioning should be considered when developing the folder and file naming structure. But the following items should be kept in mind:

  • A simple method to designate a revision is to note it at the end of the file name. This way, files can be grouped by their name and sorted by version number:
    • image1_v1.jpg
    • image1_v2.jpg
    • image2_v1.jpg
    • image2_v2.jpg
    • ...
  • If using sub-versioning then the following naming rules apply:
    • Original document: DMP_1.0
    • Original document with minor revisions: DMP_1.1
    • Document with substantial revisions: DMP_2.0
  • If variable digit version numbers are used, one issue that can arise is that computers will sort files based on the position of the characters. This can lead to undesired sorting:
    • Image0001_v1.jpg
    • Image0001_v10.jpg
    • Image0001_v2.jpg
    • ...
  • Therefore, it is recommended to use two digits version numbers:
    • Image0001_v01.jpg
    • Image0001_v02.jpg
    • Image0001_v10.jpg
    • ….
  • A good practice that can help avoiding these problems is to use dates to designate version numbers. If this strategy is chosen, then the dates should be formatted as year-month-day (20190130). Using this order will help avoid confusion when collaborating with other researchers or systems that use a day-month-year or month-day-year as dates, and it will help sort versions in chronological order:
    • Image0001_20181211
    • Image0001_20181214
    • Image0001_20190123
    • ...
  • If the files, which are used, are created or edited collaboratively, it is recommended incorporating names or initials into the file naming conventions. In this way one knows, which versions contain updates by each individual of a team:
    • Dataset0001_20180430_RM
    • Dataset0001_20180501_WIP
    • Dataset0001_20180814_HIC

4 Folder Structure of a Research Project

There are different ways to organize the folders and data files in a project. The folder structure shown below is just one possibility: