Challenges to Sustainable Data Sharing

Author: Françoise Gouzi

Recent changes in research policy and research funding recognise data sharing as a necessary condition for scientific innovation. Data sharing could also accelerate changes of practice on an infrastructural level. Open data mandates (and data management plans) are increasingly becoming conditions of research funding on the European, national, and institutional levels. They impact the working conditions and practices of all researchers within the broader arts and humanities domain, let alone those who identify as digital humanists (see DARIAH Data Policy).

In November 2023, we launched an online survey inviting five researchers working with literary corpora (from early career researchers to established ones, from the CLS Infra project and beyond) to give us insights into their data sharing practices. Their responses, which relate to both technical and culture-related aspects of CLS research, serve in this wiki as case studies, on which we base our recommendations for data sharing between researchers and institutions, formulated in section 3. With those recommendations we try to advocate the values of openness of science and reproducibility of the results as they materialize in the FAIR principles (Wilkinson et al. 2016); we strive to apply those broad-ranging principles to the specific situation of CLS researchers. In this short introduction, we use excerpts from these surveys, which we publish in their entirety in section 2).

Researchers constantly face numerous, and often conflicting, demands for data availability and reusability, voiced in various research policy recommendations and funding requirements, while struggling at the same time with profuse factors that hinder data sharing, of legal, cultural, infrastructural, and managerial nature. Therefore, although policy makers increase pressure to make data openly available, in reality, only a small fraction of datasets are reused and very few scholarly publications include references to datasets (Borgman and Groth 2024). Moreover, resources for maintaining data are scarce.

Following situations are symptomatic for the non-reproducibility of research (Desquilbet et al. 2019), that coincides with hatling data sharing:

“I’ve lost my data and my programming code”
“My results have changed”
“My code is not running”
“My new PhD student is not observing the same effects as her predecessor”

When trying to replicate earlier, non-digital work, there is often neither code (but rather a research method) nor data (just the indication of relevant literary works, sometimes not even specific editions). [from survey responses]

Legal limitations are the single biggest inhibitor to my research. Contemporary publishing has very strong IP protections limiting what researchers can do and study. [from survey responses]

Data created or collected in CLS may be subject to copyright in whole or in part. Legal restrictions can affect not only works of art and literature (including specific editions), but also software or databases, i.e. both material and tools in CLS can be subject to restrictions. Importantly, those restrictions affect research practices according to the territoriality principle. For certain topics, the EU has created an overarching legal framework, like the General Data Protection Regulation (GDPR), the Directive on the Legal Protection of Databases, or the Directive on Copyright in the Digital Single Market. These are relevant for text and data mining (TDM) in the research context.

Moreover, a crucial impediment to data sharing and the reproducibility of results amount to a human factor, i.e. the lack of trained personnel (Borgman and Groth 2024). The human factor impacts “infrastructure durability”. For this reason, training data management workforce presents a key challenge for Open Science and the FAIR principles (Borgman et al. 2016).

When we consider epistemic dimensions of data sharing, we come across such hindrances like the distance (Borgman and Groth 2024) between data creators and data reusers (domain distance, methodological distance, curation distance, spatio-temporal distance, etc.). To address the data sharing problems relative to the forms of cooperation, Borgman proposes a new form of governance: “building the village”, i.e. moving towards a community built from the outset around data sharing and the notion of mutual responsibility (Borgman and Bourne 2022). We might start establishing our village by formulating explicit policies and by documenting our data collection and modelling choices:

More and more journals have policies on code and data sharing, e.g. Cultural Analytics but also our own journal, the Journal for Computational Literary Studies. [from survey responses]

Making explicit and documenting all data modeling choices. Available ontologies don’t cover all aspects of research data in CLS, e.g. there is a big gap related to narrative and fictional elements. It’s important to start using more frequently existing ontologies: CIDOC-CRM, NLP Interchange Format, OntoLex, etc. [from survey responses]

Looking back on our experiences with the corpora of the Slovak novel and the haiku recorded in CLS INFRA deliverable 5.2 (Case studies in data preparation and sharing), we can also highlight the scarcity of resources and insufficient metadata that, for example, does not allow for identifying a literary genre or style as impediments for data sharing and reuse.

Researchers should be able to replicate analyses using the same datasets to validate, and build upon existing findings. Efficient data management promotes data sharing and collaboration within the research community. Furthermore, proper data management practices ensure the accuracy and reliability of research findings and the long-term accessibility and usability of data. This is crucial for preserving research outputs and data for future generations so as to avoid losing valuable information.

However, it is crucial to remember that creativity is the real goal of research, for which replicability and accessibility are only necessary but not sufficient conditions:

…replication in the sciences is an idea rather than an actual scholarly routine. An idea, because the code and the datasets should be made publicly available in order to make a study in question entirely replicable. In practice, however, it is much more beneficial for the sake of advancing human knowledge if a study in question opens a new perspective, and paves a way to replicate the original study in slightly different conditions. It is more beneficial, in my opinion, to extend the original study by new languages, new genres, or new literary periods. (Let alone the fact that an ideal replication study would be very difficult to publish). [from survey responses]

References

Borgman, Christine L., and Philip E. Bourne. 2022. “Why It Takes a Village to Manage and Share Data.” Harvard Data Science Review, July. https://doi.org/10.1162/99608f92.42eec111.

Borgman, Christine L., Peter T. Darch, Ashley E. Sands, and Milena S. Golshan. 2016. “The Durability and Fragility of Knowledge Infrastructures: Lessons Learned from Astronomy.” https://doi.org/10.48550/ARXIV.1611.00055.

Borgman, Christine L., and Paul T. Groth. 2024. “From Data Creator to Data Reuser: Distance Matters.” https://doi.org/10.48550/ARXIV.2402.07926.

Desquilbet, Loic, Sabrina Granger, Boris Hejblum, Arnaud Legrand, Pascal Pernot, Nicolas P. Rougier, Elisa de Castro Guerra, et al. 2019. Vers Une Recherche Reproductible. Edited by Unité régionale de formation à l’information scientifique et technique de Bordeaux. https://hal.science/hal-02144142.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.