Documentation and metadata
Answer
This article contains the following sections:
- What are data documentation and metadata?
- Why document data?
- When to document
- What to document
- Data documentation methods
What are data documentation and metadata?
Data documentation provides the contextual and explanatory information needed to discover, access, understand, and reuse research data. Without documentation it may be impossible for others, including your future self, to correctly interpret and reuse your data.
Metadata is a highly structured form of documentation that is designed to be machine-readable.
Why document data?
Data documentation makes data clear to understand and easy to use, which in turn:
- enables replication and reproducibility of research findings;
- supports research transparency;
- facilitates data sharing and reuse;
- reduces the risk of data being misinterpreted or misused;
- informs the long-term preservation of data.
When to document
It’s good practice to create data documentation from the beginning of your project, and to continue adding information as your project progresses. This is easier than trying to remember what you’ve done at a later date.
What to document
Documentation should accompany the data and can provide information at the project and data levels.
Project-level documentation
Project-level documentation provides high-level information about the research project. This could include information about:
- Research context e.g. the research questions and objectives, funders, publications from the data.
- Methods used to collect, process and analyse data e.g. workflows, sampling design, provenance of existing data sources used, protocols for cleaning, editing and coding data, any instruments, equipment, software or computer code used.
- Quality assurance procedures e.g. calibration procedures, repeat measures, using standardised methods or protocols, checking for errors.
- Who collected the data, where and when.
- Administrative information e.g. where data is available, the conditions of access and use, intellectual property rights, information on data confidentiality.
Data-level documentation
Data-level documentation provides information about the folders and files. This could include explanation of the folder structure, file naming conventions, relationships between files, the file types, etc. README files are commonly used for recording data-level documentation.
Variable-level documentation
Variable-level documentation provides information about individual data records or data points. This could include definitions and explanations of variables, values, codes, classifications, abbreviations, missing values, and units of measurement. Data dictionaries, codebooks and README files are commonly used for recording variable-level documentation.
Metadata
Metadata is often described as ‘information about data’ and is intended for reading by machines. It provides a structured and standardised way to describe, explain, locate, or otherwise make it easier to retrieve, use, or manage data.
Metadata may be created manually by people or automatically by instruments or computers. Metadata can be captured in files that are stored alongside the data or within the data files.
To illustrate, researchers typically provide metadata when depositing data in a data repository by completing a form, which often includes fields such as title, description, creator, keywords, etc. This enables the metadata to be easily searched and harvested which is useful for researchers dealing with large amounts of data. Alternatively, a camera may embed metadata within an image file, such as the date taken, camera settings, etc.
Some research communities use domain-specific metadata standards. There are also established tools for ensuring consistency in naming and meaning, such as controlled vocabularies, taxonomies, thesauri and ontologies. You can identify these community standards using these websites:
- FAIRsharing.org provides information about data and metadata standards, databases and data policies
- Metadata Standards Catalog is a directory of metadata standards for research data
Data documentation methods
Methods for documenting your data could include:
- README files: A text file included with a dataset that explains its contents and structure. A README.txt file is usually located in the top-level folder of a dataset and gives enough information to understand the project and navigate the subfolders. If necessary, supplementary READMEs can be provided within project subfolders to add clarity. Cornell University provides a guide to writing README files, including a template you can download and adapt as needed.
- Research or laboratory notebooks: Notebooks (whether paper or digital) provide a systematic and detailed record of the research by documenting hypotheses, procedures, experiments, observations, data, analysis and conclusions.
- Data dictionaries: Structured documents that provide context for tabular datasets. Data dictionaries can describe the variables, structure, content and layout of your data, and may include other information that would help others understand your project (e.g. methodological details, data collection instruments). They often record definitions of variables (names and descriptions), their units, coding values and meanings, relationships between variables, issues with the data (such as missing values and errors), etc. To illustrate, a data dictionary template is available: Phegley, L. (2023). Data Dictionary Blank Template. University of Pennsylvania.
- Codebooks: A codebook is analogous to a data dictionary but is typically used for qualitative data instead of quantitative data. Codebooks often provide context for survey, interview or other qualitative data, and focus on the interpretation of codes e.g. code definitions, when they apply, and relationship between codes.
- Computer code comments: Explanatory annotations added to computer code that make it easier to understand the code’s purpose, functions, expected format of input files, etc.
- Other supporting documentation: Documents generated during a research project that can enhance the understanding and reuse of data e.g. templates of surveys, participant information sheets and consent forms.