Preserving and Publishing Data
Author: Carolin Odebrecht
The preservation and publication of CLS data represents the crucial step in the research data lifecycle for data users because it makes the results of the previous steps available for re-use. Publication enables data creators to publish data sets (corpora, collections, editions) in order to follow good research practices that require validation, reproducibility, and citability of research data (cf. FAIR Data and Research Data Lifecycle). Data preservation is typically done by publication in a trustworthy repository.
Two aspects are important when publishing corpora: first the data set as such (Section Data Set) and the choice of the publication platform (Section Publication Platform). With the help of a to-do-list for data publication (Odebrecht and Biskup 2023) we aim to provide a helpful start for CLS data publications.
Data Set
Identify which version of a data set (title, version of the collection/corpus/edition, metadata) is eligible for publication. This data set needs to be completed by a README file containing the following information:
A README is a type of brief documentation in plaint text format (e.g. README.txt) that is directly assigned to the data set and contains the necessary explanations and references for data (re-)use.
Publication platform
A publication platform is a service that stores, preserves and provides access to data sets including metadata and search / filtering functions via metadata catalogue. When evaluating publication services, we need to focus on domain-specific criteria related to research context of the service, user community and visibility in research discipline. The choice of publication platform depends on typically the following general criteria.
Repositories can be found, e.g., by using the following meta search tools:
In general, researchers can choose between generic (e.g. Zenodo), domain-specific repositories (e.g. LAUDATIO-Repository, Textgrid-Repository) and institutional repositories, which are services provided by computing centres and libraries of academic institutions. In this context, (Harrower 2020, 12) points out: “Use disciplinary repositories where they exist, as they are more likely to be developed around domain expertise, disciplinary practices and community-based standards, which will promote the findability, accessibility, interoperability and ultimately the reuse and value of your data. The level of curation available in a repository is key to data quality and reusability.”
In terms of quality assurance, the choice of a repository depends also on interfaces for data harvesting methods and standardised metadata that are compliant with DataCite Metadata Schema from OpenAIRE, ideally additional domain-specific metadata schemas and citation suggestions (Gouzi et al. 2024, 11–12).
Certification instruments for data repositories have been developed, a prominent one being the CoreTrustSeal (CTS). At the same time, there are also data repositories that have no certification but have earned trustworthiness through many years of reliable operation by trustworthy providers and large user bases, like Zenodo.
Data publication allows for different use cases, illustrated by the examples described of this wiki: ELTeC (see Designed corpus) is organised as a community in zenodo: https://zenodo.org/communities/eltec/records. DISCO (see Opportunistic/Growing/Living/Dynamic Corpus in Corpus Design) is published as an individual data set on Zenodo. DraCor (see Opportunistic/Growing/Living/Dynamic Corpus) aggregates texts from different repositories such as Textgrid (Consortium 2006).
Generic Repository
A commonly used and trusted generic repository is Zenodo which is hosted by CERN.
Publication Platform for Digital Edition
For digital editions, the interaction between data sets, visualisation, and exploration / filter mechanism are important. Therefore, data sets of a digital editions are often published not in data repositories alone but on specific search and visualisation platforms. Digital editions platforms are findable for example by Greta Franzini’s catalogue or by Patrick Sahle’s catalogue.
The blog post of Marta Błaszczyńska and Bartłomiej Szleszyński discuss in more depth issues concerning digital scholarly editions and FAIR.
Licensing
Each data publication needs a licence statement that ensures transparency for data (and software) re-use scenarios. With a licence, creators grant right to use their work. Importantly, if there is no right of use, there is no (re-)use.
For CLS data, copy right regulations often play a crucial role. Copy rights might be regulated on the national, European, or international level. The OpenAire project provides a blog post about how to license my data. For re-using copy-right protected data, (Andresen et al. 2023) provide a workflow.