Preparing and Enriching Data

Author: Michał Mrugalski

Pre-processing Data

Data preparation and enrichment cannot be easily separated from the following step in the life cycle of data, data analysis (discussed in Exploring and Analysing Data). Since data processing is typically the first step of data analysis, the way in which we approach a corpus with a research question and hypothesis will determine how the data is prepared. In CLS, these preparation processes typically involve

digitalization (via OCR or manual transcription) of printed texts or / and re-appropriation of digital electronic material (see Reusing Data)
layout analysis including e.g. removing noise or unwanted text (editor’s footnotes, running head, etc.)
segmenting text on a sub-sentence level (tokenization, stemming, lemmatization)
removing so-called stop-words (mostly when we are interested in semantic analysis such as such as Topication, Sentiment Analysis, Named Entity Recognition, see (3.4.Analysing.qmd)|[Preserving and Publishing Data])
annotating text

Texts can be annotated both on the

document level (metadata) and
token level (tagging)

(based on: https://methods.clsinfra.io/annotation-intro.html). Standard operations on textual data require all or at least most of these steps as prerequisites.

Take-home message: The preparation of data is the first step of data analysis. Data and metadata are best prepared simultaneously: annotation processes on the document level notwithstanding, it is important to create metadata that describes data as soon as possible since metadata should contain information on pre-processing. The preparation of (meta)data is done with an explicit vision of the research outcome and research method in mind, as stated in the data collection protocol, and should likewise be documented in a log (see section 3.2).

Improving Quality of Data

The preparation of data in a broader sense, by no means limited to CLS or NLP, pertains to the quality of data that is intended to undergo an analysis. In this case, preparation amounts to data enrichment understood as “the process of enhancing existing information by

supplementing missing or incomplete data” (Allen and Cervo 2015, 152) and / or
enhancing “collected data with relevant context obtained from additional sources” (Knapp and Langill 2015, 343).

This usually occurs by adding new attributes, like for example supplementing a wikitext with literary historical metadata or comparing it with a standard edition. Problems with data quality have various root reasons, some related to data in general and some particular to CLS.

Duplicate data is a broad term for problems that arise when the same data, though somewhat different, is entered more than once when extracting data from several sources (e.g., libraries or general collections). In particular, duplicate data for CLS may indicate a lack of knowledge of literary history: for instance, the same poem or narrative prose appears twice in a corpus due to variations in author attribution, title, or even name spelling. Older literature is typically far less standardized in paratextual terms. This could be addressed by checking automatically for duplicate text sequences in the corpus. Another way to stop these kinds of errors is to carefully examine as a literary scholar the resources that are used (again) to create new data for computation. Additionally, as securing data quality falls under the responsiblity of GLAM institutions and repositories’ curators researchers should inform all interested parties about the defects they find. The enrichment of data turns out to be one of the most vital fields of cooperation between researchers and institutions sharing their expertise and knowledge.

Having the same kind of data stored in different formats likewise is a common problem with quality, correlated with using different resources. As an illustration: storing dates in multiple formats, like European Date (DD MM YYYY), US Date (MM DD YYYY), and Japan Date (YYYY MM DD).

It is also important to pay particular attention to the metadata format. Metadata inconsistency occurs when two comparable encodings follow different standards, e.g., researchers merging two separate CSV files or different libraries producing different headers in their TEI files.

The best approach is to keep a data quality issues record, ideally as an attachment to your data collection / creation log, so that you can monitor the problems and avoid errors. A plausible breakdown of a data quality issues log can be found for example here. An entry contains a description of an issue, the people responsible for surmounting difficulties, and an approach addopted on the occasion. A log enables researchers to confirm the efficacy of data cleansing procedures and develop preventative measures (See: https://www.yellowfinbi.com/best-practice-guide/data-preparation-enrichment-performance/data-quality).

Rarely is data enrichment an isolated activity. Every data enrichment task needs to be

repeatable i.e., consistently produce the desired outcomes
rule-driven in order for researchers to be able to run it again and be certain of the same result each time as well as
have a precise evaluation standard.

Take-home message: Data quality issues as well as repeatable, rule-driven, and evaluable enrichment procedures should be recorded in a special document, an extension of your data collection / creation log.

FAIR-ification

Finally, data enrichment and preparation may simply mean the FAIR-ification of data: i.e., making your data more FAIR. There are tools, supporting the principles of open science and FAIR and CARE, for example

https://openrefine.org/

which is an open-source program for handling unstructured data that can be used to clean up data and convert their formats, i.e. to deal with the issues listed in section Improving Quality of Data.

Open available automated FAIR metrics evaluation tools that help researchers to assess whether data they work with and publish comply with the FAIR Principles include: FAIRAware, SATIFYD, the FAIR evaluator, FAIR-Self-Check by NFDI4Culture.

Take-home message: There are free and open-access tools that facilitate data enrichment and refinement before one embarks on sharing and linking data.

References

Allen, Mark, and Dalton Cervo. 2015. “Data Quality Management.” In Multi-Domain Master Data Management, 131–60. Elsevier. https://doi.org/10.1016/B978-0-12-800835-5.00009-9.

Knapp, Eric D., and Joel Thomas Langill. 2015. “Exception, Anomaly, and Threat Detection.” In Industrial Network Security, 323–50. Elsevier. https://doi.org/10.1016/B978-0-12-420114-9.00011-3.