Writing a README

What is a README File?

A README file aims to describe the produced data and make it easier to use, retrieve, and manage the data. It describes data on multiple levels, so the process of data handling is transparent and so that data will be understandable in the future by yourself and by others. Producing high-quality documentation during your research ensures that your data can be:

Understood now and in the future
Interpreted quickly with the context of its creation
Reused by yourself and collaboratively with others

A README is a document file stored along with the data and is part of a data package and should be referred to in the metadata, which contains the general description of a dataset and enables a broad scientific community to find, share and understand the content and context of a dataset.

Attend one of our README Workshops on February 14 & 16

What should you include in a README file?

The following topics should be included in a README:

Title: A very short and descriptive title of your dataset or code, corresponding not only to the research project but also to the dataset / code itself. It’s important to keep your title concise, that means, it should contain the most important (few) keyword(s) and ideally be between 50 and 60 characters long.
Name of the corresponding creator(s) and contact information: This will help others to contact you or your team members when needed. Including ORCID in the contact information makes it easier to find you later.
License: Choose a proper license for your dataset/code to let others know to which extend they can reuse them. You can use the Choose a License tool to decide which license suits your dataset/code.
Choose a license
General content description: Brief description (4 or 5 sentences) of the dataset (e.g., date of data collected, project name and funder, link to the supplementary materials, …) and what content can be found in the data package.
Folder structure / relations (also mentioned below): Explanation on the inserted structure of folders and subfolders in order to distinguish between recorded data files. When two or more (sub) folders have relations with each other of any kind, also mentions these.
Folder contents: Describe the content of the included data files and how it is positioned within the present folders. Also make clear how the data can be used and what its purpose is.
Used abbreviations/acronyms: Make a list of used abbreviations and acronyms used in files, columns and in filenames.
Codebook (when not provided in separate files): Description of how the data was attained or which settings were used on machines to attain the data. This should also include what the categories in datafiles columns mean and what processing of data took place to attain the files.
Description of workflow: this workflow explains which processing steps have taken place on the data, what analysis on the data is performed and which methods have been applied to get the data in the form as they were added in the data package
Ethical review (if applicable): when an ethical assessment is carried out by an ethics review board or committee, details about the assessment and its results should be mentioned
Software or instrument-specific information: needed to understand or interpret the data, including software and hardware version numbers. Include measurement details, used standards and calibration information, if appropriate.

If the dataset includes multiple files that relate to one another, the relationship between the files or a description of the file structure that holds them would be helpful. There may also be information about related data collected but that is not in the described dataset. 

There should be description of any quality-assurance procedures performed on the data (for instance, definitions of codes or symbols used to note or characterize low quality/questionable/outliers/missing data that people should be aware of).

Also, documents dealing with data management and privacy aspects around the project can be considered as data documentation and should be included with the data package.