Planning and Designing Data
Author: Carolin Odebrecht
Firstly, CLS data typically require a careful corpus design reflecting for example the preservation conditions of sources and their genre see CLS INFRA’s Survey of Methods (Schöch, Dudar, and Fileva 2023) and register (Biber and Conrad 2019) identification, authorship attribution (Schöch, Dudar, and Fileva 2023), and the evaluation or classification of textual material with regard to, e.g., cultural, social, and literary contexts, traditions, and canonicity in specific CLS domains. In this section, we present three types of corpus design: Designed Corpus, Opportunistic/Growing/Dynamic Corpus and Edition.
Secondly, designing the ways of the preparation and especially the annotation of CLS data appears to be particularly demanding (Section Preparing and Enriching Data).
Thirdly, the availability of resources and their terms of use must be clarified at the initial stage of planning and designing (Preserving and Publishing Data).
CLS data in general requires an intensive corpus design including but not limited to:
These design criteria operationalise the decision made by the corpus creators in relation to the corpus-internal and corpus-external aspects of the literary data, e.g., authorship, work, genre, period, bibliographical metadata, publication history and availability, text length and language. The following examples illustrate how these criteria might be implemented. However, the examples are not a representative or normative selection.
Designed corpus
European Literary Text Collection ELTeC (Odebrecht, Burnard, and Schöch 2021) can serve as an example of a careful corpus design in the domain of CLS and is one of the key deliverables of the COST Action ‘Distant Reading for European Literary History’ (CA16204). ELTeC is designed collection of literary texts and serves as a benchmark corpus for computational methods in CLS. “Its composition is determined not by the happenstance of whatever we can get our hands on, but is instead defensible, at least in theory, as a principled and representative selection.” (Burnard, Schöch, and Odebrecht 2021).
This corpus design focuses on a metadata-based approach that reflects potential novel candidates but does not determine in advance what a novel in a European literary perspective might be. Furthermore, as the population of European novel cannot be defined (yet), i.e. no extensive bibliography of the European novel exists, the corpus design focuses on the representation of variation and not of the population.
For collaborative projects and large data sets, corpus monitoring might be applicable. ELTeC is monitored on the base of each text’s metadata: ELTeC-Monitoring.
The collection of Eighteenth-Century French Novels is another good example for a designed corpus that uses balancing criteria (Röttgermann 2023). “The collection is created in the context of Mining and Modeling Text (2019-2023), a project which is located at the Trier Center for Digital Humanities (TCDH) at Trier University.” (cf. Homepage).
The BGRF enables data creators to design the corpus with respect to the population and thus, they are able model the corpus’s balancing criteria with reference to this population.
For collaborative projects and large data sets, corpus monitoring may be applicable. This collection is monitored on the base of each text’s metadata: Eighteen-Century Franch Novel-Monitoring.
Opportunistic/Growing/Living/Dynamic Corpus
The Drama Corpora Project DraCor (Fischer et al. 2019) contains an ever-growing collection of plays in different languages and provides an Open Access platform for displaying, analysing, and downloading drama data.
The corpus design focuses on the genre drama without any pre-determined sub-genre or/and subject classifications that could shape the corpus composition. Growing corpora have the advantage over the scoped/designed ones that new data can be added continuously without having to check for, e.g., balance or composition criteria. DraCor lays the foundation for research projects that are for instance compiling data form different sources to a designed corpus or that focus genre-specific research in either mono- or multilingual settings.
The Diachronic Spanish Sonnet Corpus DISCO (Ruiz Fabo and Bermúdez Sabel 2023) is another example of a growing corpus that focuses on sonnets in Spanish from 15th to early-mid 20th century.
With the help of a corpus monitoring, the composition of the corpus can be explored. For example, it is stated that “Although overall in the corpus we deliberately included less canonical writers, less than 10% of the authors are female. An active search will be carried out to counteract this lack of diversity.” DISCO Corpus monitoring
Digital Edition
The Faust-Edition (Bohnenkamp, Henke, and Jannidis n.d.) is an example for a digital edition with a common design focus. It collects and presents the manuscripts and the text-critically relevant printings of Faust that were published during Goethe’s lifetime in order to make the analysis of the work’s genesis possible. Central for this digital edition is that the textual variants can be analysed individually, in the context of others variants, and the entire genesis of the work.