READMEs for Research Data

What is a README file?

A README is a text file that introduces and explains the contents of your project folder, published data, or code. It usually describes the background, context, and collection of research data, and defines how the data may be reused through the inclusion of a license. It is usually written in a plain text format (.txt or .md) so that anyone can open and read it. A README file sits beside the data files and is an instruction to read this file in order to make sense of the associated research data files.

Why create a README file

README files are intended to ensure that the data they describe can be correctly interpreted by you at a later date, or by others when they are shared or published. From your README file, other people interested in using your data should know what the data files contain, which parts of the research they relate to, how files relate to one another, how the data was generated, how data files have been processed or transformed, and whether or not there are any restrictions on who can view or access them.

READMEs also often come in handy when revisiting your own projects months or years after you’ve previously worked on them. Recording information in a README as you collect and process your data means you will be able to remember what you did and why.

Want a template? Download this README.txt and adapt it for your data.

Best (Better) Practices

Create README files for logical "clusters" of related files / data. In many cases it will be appropriate to create one document for a dataset that has multiple, related, similarly formatted files, or files that are logically grouped together for use (e.g. a collection of Matlab scripts). Sometimes it may make sense to create a README for a single data file or multiple READMEs for a larger, more complex dataset.

Name the README so that it is easily associated with the data file(s) it describes.

Write your README document as a plain text file, avoiding proprietary formats such as MS Word whenever possible. Format the README document so it is easy to understand (e.g. separate important pieces of information with blank lines, rather than having all the information in one long paragraph).

Format multiple README files identically. Present the information in the same order, using the same terminology.

Use standardized date formats. Suggested format: W3C/ISO 8601 date standard, which specifies the international standard notation of YYYY-MM-DD or YYYY-MM-DDThh:mm:ss.

Recommended content

Recommended minimum content for data re-use is in bold.

General information
- Provide a title for the dataset
- Name/institution/address/email information/ORCiD for
  - Principal investigator (or person responsible for collecting the data)
  - Associate or co-investigators
  - Contact person for questions
- Date of data collection (can be a single date, or a range)
- Information about geographic location of data collection
- Keywords used to describe the data topic
- Language information
- Information about funding sources that supported the collection of the data
Data and file overview
- For each filename, a short description of what data it contains
  - NOTE: When working with a large number of files, a short description about what each collection of similar files contains may be best
- Format of the file if not obvious from the file name
- If the data set includes multiple files that relate to one another, the relationship between the files or a description of the file structure that holds them (possible terminology might include "dataset" or "study" or "data package")
- Date that the file was created
- Date(s) that the file(s) was updated (versioned) and the nature of the update(s), if applicable
- Information about related data collected but that is not in the described dataset
Sharing and access information
- Licenses or restrictions placed on the data
- Links to publications that cite or use the data
- Links to other publicly accessible locations of the data (see best practices for sharing data for more information about identifying repositories)
- Recommended citation for the data (see best practices for data citation)
Methodological information
- Description of methods for data collection or generation (include links or references to publications or other documentation containing experimental design or protocols used)
- Description of methods used for data processing (describe how the data were generated from the raw or collected data)
- Any software or instrument-specific information needed to understand or interpret the data, including software and hardware version numbers
- Standards and calibration information, if appropriate
- Describe any quality-assurance procedures performed on the data
- Definitions of codes or symbols used to note or characterize low quality/questionable/outliers that people should be aware
- People involved with sample collection, processing, analysis and/or submission
Data-specific information
*Repeat this section as needed for each dataset (or file, as appropriate)*
- Count of number of variables, and number of cases or rows
- Variable list, including full names and definitions (spell out abbreviated words) of column headings for tabular data
- Units of measurement
- Definitions for codes or symbols used to record missing data
- Specialized formats or other abbreviations used

Source material for this page: Cornell University's Research Data Management Service Group.