Planning and Designing Data

Author: Carolin Odebrecht

Firstly, CLS data typically require a careful corpus design reflecting for example the preservation conditions of sources and their genre see CLS INFRA’s Survey of Methods (Schöch, Dudar, and Fileva 2023) and register (Biber and Conrad 2019) identification, authorship attribution (Schöch, Dudar, and Fileva 2023), and the evaluation or classification of textual material with regard to, e.g., cultural, social, and literary contexts, traditions, and canonicity in specific CLS domains. In this section, we present three types of corpus design: Designed Corpus, Opportunistic/Growing/Dynamic Corpus and Edition.

Secondly, designing the ways of the preparation and especially the annotation of CLS data appears to be particularly demanding (Section Preparing and Enriching Data).

Thirdly, the availability of resources and their terms of use must be clarified at the initial stage of planning and designing (Preserving and Publishing Data).

CLS data in general requires an intensive corpus design including but not limited to:

These design criteria operationalise the decision made by the corpus creators in relation to the corpus-internal and corpus-external aspects of the literary data, e.g., authorship, work, genre, period, bibliographical metadata, publication history and availability, text length and language. The following examples illustrate how these criteria might be implemented. However, the examples are not a representative or normative selection.

Designed corpus

European Literary Text Collection ELTeC (Odebrecht, Burnard, and Schöch 2021) can serve as an example of a careful corpus design in the domain of CLS and is one of the key deliverables of the COST Action ‘Distant Reading for European Literary History’ (CA16204). ELTeC is designed collection of literary texts and serves as a benchmark corpus for computational methods in CLS. “Its composition is determined not by the happenstance of whatever we can get our hands on, but is instead defensible, at least in theory, as a principled and representative selection.” (Burnard, Schöch, and Odebrecht 2021).

This corpus design focuses on a metadata-based approach that reflects potential novel candidates but does not determine in advance what a novel in a European literary perspective might be. Furthermore, as the population of European novel cannot be defined (yet), i.e. no extensive bibliography of the European novel exists, the corpus design focuses on the representation of variation and not of the population.

For collaborative projects and large data sets, corpus monitoring might be applicable. ELTeC is monitored on the base of each text’s metadata: ELTeC-Monitoring.

The collection of Eighteenth-Century French Novels is another good example for a designed corpus that uses balancing criteria (Röttgermann 2023). “The collection is created in the context of Mining and Modeling Text (2019-2023), a project which is located at the Trier Center for Digital Humanities (TCDH) at Trier University.” (cf. Homepage).

The BGRF enables data creators to design the corpus with respect to the population and thus, they are able model the corpus’s balancing criteria with reference to this population.

For collaborative projects and large data sets, corpus monitoring may be applicable. This collection is monitored on the base of each text’s metadata: Eighteen-Century Franch Novel-Monitoring.

Opportunistic/Growing/Living/Dynamic Corpus

The Drama Corpora Project DraCor (Fischer et al. 2019) contains an ever-growing collection of plays in different languages and provides an Open Access platform for displaying, analysing, and downloading drama data.

The corpus design focuses on the genre drama without any pre-determined sub-genre or/and subject classifications that could shape the corpus composition. Growing corpora have the advantage over the scoped/designed ones that new data can be added continuously without having to check for, e.g., balance or composition criteria. DraCor lays the foundation for research projects that are for instance compiling data form different sources to a designed corpus or that focus genre-specific research in either mono- or multilingual settings.

The Diachronic Spanish Sonnet Corpus DISCO (Ruiz Fabo and Bermúdez Sabel 2023) is another example of a growing corpus that focuses on sonnets in Spanish from 15th to early-mid 20th century.

With the help of a corpus monitoring, the composition of the corpus can be explored. For example, it is stated that “Although overall in the corpus we deliberately included less canonical writers, less than 10% of the authors are female. An active search will be carried out to counteract this lack of diversity.” DISCO Corpus monitoring

Digital Edition

The Faust-Edition (Bohnenkamp, Henke, and Jannidis n.d.) is an example for a digital edition with a common design focus. It collects and presents the manuscripts and the text-critically relevant printings of Faust that were published during Goethe’s lifetime in order to make the analysis of the work’s genesis possible. Central for this digital edition is that the textual variants can be analysed individually, in the context of others variants, and the entire genesis of the work.

References

Biber, Douglas, and Susan Conrad. 2019. Register, Genre, and Style. 2nd ed. Cambridge University Press. https://doi.org/10.1017/9781108686136.
Bohnenkamp, Anne, Silke Henke, and Fotis Jannidis. n.d. “Faust. Historisch-Kritische Edition.” Digital {Edition}. Frankfurt am Main / Weimar / Würzburg. Accessed March 7, 2024. https://faustedition.net/.
Burnard, Lou, Christof Schöch, and Carolin Odebrecht. 2021. “In Search of Comity: TEI for Distant Reading.” Journal of the Text Encoding Initiative, no. Issue 14 (March). https://doi.org/10.4000/jtei.3500.
Fischer, Frank, Ingo Börner, Mathias Göbel, Angelika Hechtl, Christopher Kittel, Carsten Milling, and Peer Trilcke. 2019. “Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama,” July. https://doi.org/10.5281/ZENODO.4284002.
Odebrecht, Carolin, Lou Burnard, and Christof Schöch. 2021. “European Literary Text Collection (ELTeC): April 2021 Release with 14 Collections of at Least 50 Novels.” Zenodo. https://doi.org/10.5281/ZENODO.4662444.
Röttgermann, Julia. 2023. “Collection de Romans Français Du Dix-Huitième Siècle (1751-1800) / Collection of Eighteenth Century French Novels 1751-1800.” [object Object]. https://doi.org/10.5281/ZENODO.10404966.
Ruiz Fabo, Pablo, and Helena Bermúdez Sabel. 2023. “Pruizf/Disco: Version 5.0.” [object Object]. https://doi.org/10.5281/ZENODO.1012567.
Schöch, Christof, Julia Dudar, and Evgeniia Fileva. 2023. CLS INFRA D3.2: Series of Five Short Survey Papers on Methodological Issues (= Survey of Methods in Computational Literary Studies).” Zenodo. https://doi.org/10.5281/ZENODO.7892112.