Exploring and Analysing Data

Author: Michał Mrugalski

The act of analyzing or evaluating data can be characterized as validating hypotheses on corpora in relation to population or, in more exploratory approaches, testing preliminary insights into data patterns, e.g. observing how corpus elements cluster with regard to some predetermined features. Analysis is done with an eye toward a result. It aims at extracting relevant information from data. Data should therefore become as a result of the analysis interpretable information in a human-readable form, such as a report, table, visualization, etc. Or, to put it differently, computing turns data into information interesting and intelligible for humans.

In CLS, various analysis techniques are employed, frequently combined into a single project. These are:

manual (human) data analysis: in CLS such an analysis is typically performed by a qualified team that annotates texts to give benchmark results for Machine Learning algorithms (for example: a group of previously instructed students decides whether a passage making part of a training data bears a narrative or a non-narrative character so that an algorithm can learn to classify in this regard texts)
Electronic data analysis involves software that filters and reorganizes data, typically in the form of program packages for Machine Learning or, specifically, Natural Language Processing
Generative AI data analysis uses Artificial Intelligence and Machine Learning methods to classify and arrange data.

Machine Learning and Natural Language Processing, which can be regarded as an application of ML to language comprehension and text generation, are the main sources of research procedures for CLS. This relates to Optical Character Recognition (OCR) and text analysis (also known as text mining or content analysis). Text analysis can be categorized based on the language levels it considers:

The morphological analysis, which, interesting in itself as it is for linguistics, CLS relegates to the pre-processing stage as it encompasses lemmatization, parts of speech tagging, and other procedures, regarded as preparatory (see Preparing and Enriching Data),
syntactic analysis or
lexical semantics.

Especially the later enjoys great popularity among CLS researchers as it includes Word Embeddings capturing inter-word relationships, Named Entity Recognition (NER), a prerequisite for automated network analysis and geographical analysis of literature, Sentiment Analysis, Topic Modelling or Terminology Extraction. Higher levels include relational and discourse semantics and such advanced NLP applications as summarization and translation.

Whatever procedure one adopts, it is vital to keep a laboratory log, noting any manipulation of data alongside a reason for said manipulation, its author, and the point in time it happened. “Changing information in itself is not a problem. Difficulties arise when these modifications have not been documented or traced” (Desquilbet et al. 2019, 54).

Take-home message: analysis results in information extracted from data and presented in a human-comprehensible form. Needless to say, all procedures that data undergoes should be recorded.

References

Desquilbet, Loic, Sabrina Granger, Boris Hejblum, Arnaud Legrand, Pascal Pernot, Nicolas P. Rougier, Elisa de Castro Guerra, et al. 2019. Vers Une Recherche Reproductible. Edited by Unité régionale de formation à l’information scientifique et technique de Bordeaux. https://hal.science/hal-02144142.