Exploring and Analysing Data

Author: Michał Mrugalski

The act of analyzing or evaluating data can be characterized as validating hypotheses on corpora in relation to population or, in more exploratory approaches, testing preliminary insights into data patterns, e.g. observing how corpus elements cluster with regard to some predetermined features. Analysis is done with an eye toward a result. It aims at extracting relevant information from data. Data should therefore become as a result of the analysis interpretable information in a human-readable form, such as a report, table, visualization, etc. Or, to put it differently, computing turns data into information interesting and intelligible for humans.

In CLS, various analysis techniques are employed, frequently combined into a single project. These are:

Machine Learning and Natural Language Processing, which can be regarded as an application of ML to language comprehension and text generation, are the main sources of research procedures for CLS. This relates to Optical Character Recognition (OCR) and text analysis (also known as text mining or content analysis). Text analysis can be categorized based on the language levels it considers:

Especially the later enjoys great popularity among CLS researchers as it includes Word Embeddings capturing inter-word relationships, Named Entity Recognition (NER), a prerequisite for automated network analysis and geographical analysis of literature, Sentiment Analysis, Topic Modelling or Terminology Extraction. Higher levels include relational and discourse semantics and such advanced NLP applications as summarization and translation.

Whatever procedure one adopts, it is vital to keep a laboratory log, noting any manipulation of data alongside a reason for said manipulation, its author, and the point in time it happened. “Changing information in itself is not a problem. Difficulties arise when these modifications have not been documented or traced” (Desquilbet et al. 2019, 54).

Take-home message: analysis results in information extracted from data and presented in a human-comprehensible form. Needless to say, all procedures that data undergoes should be recorded.

References

Desquilbet, Loic, Sabrina Granger, Boris Hejblum, Arnaud Legrand, Pascal Pernot, Nicolas P. Rougier, Elisa de Castro Guerra, et al. 2019. Vers Une Recherche Reproductible. Edited by Unité régionale de formation à l’information scientifique et technique de Bordeaux. https://hal.science/hal-02144142.