Open and FAIR File Formats
Guiding Principles for Choosing an Open and FAIR File Format
Open data file formats are important because they make it likely that your data files are still accessible 10 years from now. Furthermore, they allow other researchers with different operating systems with different software packages to access the data.
Open data formats are preferred over proprietary formats because proprietary formats can become unsupported if the company that created them, stops developing the software or goes out of business. This is especially problematic with instruments that only output data in a proprietary file format.
Selecting an open data format can be difficult as there are multiple considerations to make. Here are a few things to consider:
The file should have a specification document open for all to see. Some for-profit companies have established file formats that are considered the de–facto industry standard and later shared their file specification so that other software can also open it.
The data formats should be in widespread use among many applications from different software creators/companies. Something used in a propriety software should also be useable in and free/libre and open-source applications.
A file may be openable with a text editor and available for anyone to read. These will be in a machine-readable format usually, from the simple comma separated table (csv) format or to more complex XML or JSON documents. Not all open formats use text-based: files, images and large numerical datasets often use binary file formats, which relies on an open specification.
Some file formats may not facilitate all the functionality that is available to a particular data type, here are a few examples:
- Documents – Text (.txt) documents are basic text-only documents. They do not facilitate even basic stylization such as bold, italic, or underlined text. Word files (.docx) support a wide range of formatting for text documents: you can make text bold, italic, or underlined, add photos, change the page size, etc. However, .docx files are proprietary to Microsoft, the open file format alternative that supports most of the same functionality is the Open Document Format (.odt).
- Spreadsheets – Comma separated values (.csv) spreadsheets are basic text-based files which are good for sharing a single set of data. Excel (.xlsx) files can store multiple datasets in a single file, can define data types, reference other spreadsheets, store formulas for cell values, and automate functions in the spreadsheet with macros. An open file format alternative with the same functionality as Excel spreadsheets is the Open Document Spreadsheet format (.ods).
- Images – JPEG files are a common image format optimized for storing photographs that stores large images in a relatively small size. However, the algorithm that makes the files as small as they are also leads to data loss and degradation every time they are processed. The TIF file format compression algorithm does not degrade the image when it is processed, but this also means that TIF files are larger than JPEGs. TIF files do however have more features than JPEG; more colour channels, embedding other data such as geographic coordinates, storing decimal numbers in addition to integers, and many more. Both JPEG and TIF are open file formats. In the geosciences we suggest using TIF for most scientific research.
- Almost all instruments that measure things, like mass spectrometers, XRF scanners, and GPS/GNSS units output their own proprietary file formats that require the manufacturer’s software package (or a specific version!) to access the data. Often, exports of these data do not contain all the data that is in the file.
Open and FAIR Format Lists
There are many lists of open data formats to use as reference, not all of them contain the same guidance, and there can be many disagreements between researchers in the open science and open data community. If you’re going to be publishing data in a repository, they may provide a list of formats that you should use to publish there.
If not, here are resources with open data formats.
Contact your data steward for help selecting a FAIR and Open Data file format for your data.