Creating and Collecting Data
Author: Michał Mrugalski
The Notion of Data and their Collection / Creation
Nothing else has the same impact on research findings in CLS as collecting / creating data according to a corpus design. At this stage, researchers can control a wide range of variables as their manipulation of these variables ultimately result in varied outcomes.
Conducive to the openness and the reproducibility of research, it is recommend that CLS researchers from the outset regard the process of collecting and creating data (corpora) as producing an artifact that can be reused by other researchers (Harrower 2020, 9).
But why the slash symbol between “collecting” and “creating” data? Why the insistence that “data” in CLS is a function of a “corpus” design?
To quote a well-received essay collection, “raw data is an oxymoron” (Gitelman 2013). The process of “collecting” data is not “harvesting” “raw” material, but in fact always happens with a view to an envisioned corpus; a corpus design predetermines what one looks for. “Data” is something that exists in a corpus and only as a function of a corpus can be an object of study in the framework of CLS. And the other way round: the creation of a CLS dataset inevitably entails reusing pre-existing data.
The first step in any CLS research project amounts thus to explicitly defining what one understands as “data”, especially since Literary Studies abstains from using the very term. As Jennifer Edmond and Erzsébet Tóth-Czifra put it, literary scholars “resist the blanket term ‘data’ for the very good reason that we have more and precise terminology (e.g. primary sources, secondary sources, theoretical documents, bibliographies, critical editions, annotations, notes, etc.) available to us to describe and make transparent our research processes” (Edmond and Tóth-Czifra 2018, 1).
But since in computer science, “data” is a pendant of “computing”, its hard to imagine CLS without a notion of data; not to mention that a good portion of CLS research consists of applying the procedures of “data science” to corpora of literary texts. Of course, these procedures are adapted to issues of interest for literary or cultural scholars, such as narrative point of view (Piper and Bagga 2022; Piper and Toubia 2023).
The authors of the Allea group report (Harrower 2020, 8) propose to define “data in the humanities broadly as all materials and assets scholars collect, generate and use during all stages of the research cycle.” Some of these data have “digital form” and as such they are the object of CLS. By “digital” we mean their quality of being machine manipulable, i.e. primarily countable and enrichable (they can be annotated or commented on, see Preparing and Enriching Data). As discreet, they can also easily be subjected to reduction, as with the removal of stop words, etc.
With reproducibility in mind and in keeping with the FAIR principles, we cannot think of data in isolation from an identifier and metadata, which both allow for the identification of the data set, its provenance alongside the enrichments the data went through in the process of collection / creation and subsequent pre-processing as well as analysis. The authors of the Allea group report (Harrower 2020, 17) deem it “central to the realization of FAIR” that we use “the concept of a FAIR digital object – an elemental ‘bundle’ that includes the research data, the persistent identifier (PID), and “metadata rich enough to enable them to be reliably found, used and cited.” In short, our research corpus is a bundle consisting of data, PID, and metadata.
The explicit definition of (our) data should be made part of a protocol of data collection / creation. A central aspect of a reproducibility-centered open-science-mindedness is documenting the process of data collection and creation, in the form of a protocol, where a researcher describes how they obtained each piece of data (Desquilbet et al. 2019, 53, see also 32). The protocol should contain maximum information as “at the beginning of a project it is difficult to judge which information of the research process will be important and valuable later on” (Harrower 2020, 9) and to other researchers. Qualities of research objects change when you change the method of observation, therefore both your harvesting method and your protocol should be as standardized as possible (Desquilbet et al. 2019, 25–26).
Ideally, the information contained in the collection protocol should be part of the published article reporting the results of your research (Desquilbet et al. 2019, 53). In the case of CLS, it may mean that one documents the process of data collection and creation with the intention of publishing a data papier, either in a public repository or in a journal dedicated to the publication of open data. A list of such journals include (Tóth-Czifra et al. 2023, 76):
- Research Data Journal for the Humanities and Social Sciences
- Journal of Open Humanities Data
- Journal of Cultural Analytics
- Journal of the Text Encoding Initiative
- RIDE – A Review Journal for Digital Editions and Resources
Die Zeitschrift für digitale Geisteswissenschaften likewise welcomes data pubications. Some of those journals, particularly the Journal of Open Humanities Data, publish templates for the assessment of corpora. The criteria contained in said templates can serve researchers as a guide, when rendering their own corpora and their own protocols of rendering corpora.
Take-home message: in the spirit of FAIR-ness, researchers should explicitly define their data with the view to their corpus design and research intention and log all collecting decisions, ideally with a view to publishing not only their data, but also a data paper expanding on a rationale for their choices.
Creating Data by Collecting Them in Corpora
The object of study in CLS is not a single text, but a collection of text or a corpus (Gavin 2022, chap. 1): The slash between creation and collection means that CLS researchers basically create data by collecting textual materials according to their general corpus design.
As discussed in Planning and Designing Data, eligibility of texts and the texts’ relations to one another within the corpus provide an additional order turning textual noize into data, loaded with potential information for analysis to extract. Should the corpus consist of whole “works” or just their fragments, and, if the latter holds, how to state the limit of textual pieces? (Most computational methods tokenize texts, turning them into “bags of words”, so, in contradistinction to traditional literary analysis, the analysis of segments of comparable length renders more reliable results than one based on “entire” utterances.) How to get hold of textual material? Can it be found online in a more or less reliable source or should it be digitized? How? By applying Optical Character Recognition or by manually typing in words? How should the OCR process or typing be verified?
Unless we create our data from scratch, the recommended sources for CLS include general-purpose collections (see our deliverable D5.1), such as HathiTrust Digital Library or Wikisource, data published by researchers in the framework of specific research projects in trusted repositories and perhaps even reviewed in data journals. Trusted repositories for CLS include, alongside GitHub and GitLab pages associated with specific research projects,
A good starting point is a registry of Research Data Repositories, such as re3data or FAIRsharing.org, Humanities Commons, UK Data Archive, Arche. “While browsing humanities datasets, one should take into account whether one’s own dataset “could be published in a similar fashion” (Harrower 2020, 9).
One important source of data in the humanities are GLAM institutions (galleries, libraries, archives, museums). To ensure a smooth cooperation with GLAM institutions, it is advised to rely on a set of guidelines, which include models for specific agreements between institutions and researchers, such as Appendix1 to The Heritage Data Reuse Charter. The Heritage Data Reuse Charter is an implementation in the context of the arts and humanities of the DARIAH Campus Reuse Charter. You can find out more about data reuse in section Reusing Data.
Texts must be encoded in a particular way (modelling resp. encoding scheme and format), that allows for annotation and manipulation. As for data and encoding formats, it is reccomended to utilize open-license (non-proprietary) formats curated by the World Wide Web Consortium W3C (mostly XML, RDF, JSON), especially as they undergo enrichment and concretization under the auspices of international communities, such as Text Encoding Initiative (TEI), Music Encoding Initiative (MEI), and International Image Interoperability Framework (IIIF). Moreover, human and machine-readable systems are preferable to only machine-readable ones as it “provides better sustainability and long-term accessibility” (Harrower 2020, 20).
Metadata captures exactly information on the eligibility, provenance, format, and modelling of the data. Consistent with the FAIR principles, your metadata should adopt one of the common standards, e.g. Dublin Core, MARC (a library cataloging standard), or an archive standard EAD (Harrower 2020, 17). “A good starting point is to consult the Metadata Standards Directory, a community-maintained directory hosted by the Research Data Alliance” (Harrower 2020, 18).
In some encoding formats, such as XML/TEI metadata make part of a document; in other cases, they are usually contained in separate file, for example as a csv table or a json “dictionary”.
It is important to rely on a controlled vocabulary for choosing your metadata categories. In literary studies, there is no one universally accepted standard as you may see it in the interviews with our experts. The most popular standards FRBRoo and CIDOC CRM are very high-level and it is up to literary researchers to specify them with regard to literary terms. A literary particularization was implemented in an exemplary way in the POSTDATA Network of Ontologies for European Poetry. One can rely on the Library of Congress for subject headings or make use of the catalog of vocabularies (especially Art and Architecture(Getty)) or, even better, wikidata, because of wikibase’s great potential to create a basis for knowledge graphs that inscribe our output into the Linked Open Data universe.
To “clean” your metadata, think about utilizing an application like Open Refine.
Take-home message: Researchers create textual data by collecting them for the sake of their corpus design that determines the scope, balance, provenance, format, and encoding of meta-data. Of course, all decisions should be documented in a data collection protocol.