Preparing and Enriching Data
Author: Michał Mrugalski
Pre-processing Data
Data preparation and enrichment cannot be easily separated from the following step in the life cycle of data, data analysis (discussed in Exploring and Analysing Data). Since data processing is typically the first step of data analysis, the way in which we approach a corpus with a research question and hypothesis will determine how the data is prepared. In CLS, these preparation processes typically involve
Texts can be annotated both on the
(based on: https://methods.clsinfra.io/annotation-intro.html). Standard operations on textual data require all or at least most of these steps as prerequisites.
Take-home message: The preparation of data is the first step of data analysis. Data and metadata are best prepared simultaneously: annotation processes on the document level notwithstanding, it is important to create metadata that describes data as soon as possible since metadata should contain information on pre-processing. The preparation of (meta)data is done with an explicit vision of the research outcome and research method in mind, as stated in the data collection protocol, and should likewise be documented in a log (see section 3.2).
Improving Quality of Data
The preparation of data in a broader sense, by no means limited to CLS or NLP, pertains to the quality of data that is intended to undergo an analysis. In this case, preparation amounts to data enrichment understood as “the process of enhancing existing information by
This usually occurs by adding new attributes, like for example supplementing a wikitext with literary historical metadata or comparing it with a standard edition. Problems with data quality have various root reasons, some related to data in general and some particular to CLS.
Duplicate data is a broad term for problems that arise when the same data, though somewhat different, is entered more than once when extracting data from several sources (e.g., libraries or general collections). In particular, duplicate data for CLS may indicate a lack of knowledge of literary history: for instance, the same poem or narrative prose appears twice in a corpus due to variations in author attribution, title, or even name spelling. Older literature is typically far less standardized in paratextual terms. This could be addressed by checking automatically for duplicate text sequences in the corpus. Another way to stop these kinds of errors is to carefully examine as a literary scholar the resources that are used (again) to create new data for computation. Additionally, as securing data quality falls under the responsiblity of GLAM institutions and repositories’ curators researchers should inform all interested parties about the defects they find. The enrichment of data turns out to be one of the most vital fields of cooperation between researchers and institutions sharing their expertise and knowledge.
Having the same kind of data stored in different formats likewise is a common problem with quality, correlated with using different resources. As an illustration: storing dates in multiple formats, like European Date (DD MM YYYY), US Date (MM DD YYYY), and Japan Date (YYYY MM DD).
It is also important to pay particular attention to the metadata format. Metadata inconsistency occurs when two comparable encodings follow different standards, e.g., researchers merging two separate CSV files or different libraries producing different headers in their TEI files.
The best approach is to keep a data quality issues record, ideally as an attachment to your data collection / creation log, so that you can monitor the problems and avoid errors. A plausible breakdown of a data quality issues log can be found for example here. An entry contains a description of an issue, the people responsible for surmounting difficulties, and an approach addopted on the occasion. A log enables researchers to confirm the efficacy of data cleansing procedures and develop preventative measures (See: https://www.yellowfinbi.com/best-practice-guide/data-preparation-enrichment-performance/data-quality).
Rarely is data enrichment an isolated activity. Every data enrichment task needs to be
Take-home message: Data quality issues as well as repeatable, rule-driven, and evaluable enrichment procedures should be recorded in a special document, an extension of your data collection / creation log.
FAIR-ification
Finally, data enrichment and preparation may simply mean the FAIR-ification of data: i.e., making your data more FAIR. There are tools, supporting the principles of open science and FAIR and CARE, for example
https://openrefine.org/
which is an open-source program for handling unstructured data that can be used to clean up data and convert their formats, i.e. to deal with the issues listed in section Improving Quality of Data.
Open available automated FAIR metrics evaluation tools that help researchers to assess whether data they work with and publish comply with the FAIR Principles include: FAIRAware, SATIFYD, the FAIR evaluator, FAIR-Self-Check by NFDI4Culture.
Take-home message: There are free and open-access tools that facilitate data enrichment and refinement before one embarks on sharing and linking data.