Maciej Eder

1. Could you describe your research project, which required gathering and disseminating a large amount of data?

Our research projects at the Institute for Polish Language of the Polish Academy of Sciences focus on applying computational techniques to analyze texts. While these texts could be generally considered to be literary, non-literary sources (or even non-textual in the strictest sense) also fall into our area of interest. The goal of most of these projects is to uncover patterns, themes, and similarities across texts using methods such as clustering, topic modeling, and sentiment analysis. The actual size of the datasets might vary. Some of the projects conducted by our team include:

2. How do you discover data that is relevant for your research and which factors help you to assess its quality and trustworthiness?

Discovering relevant data involves a combination of methods such as literature review, online databases, digital libraries, and collaborations with other researchers or institutions that specialise in digitising and archiving literary materials. Factors that help assess data quality and trustworthiness include:

  1. accuracy (controlling the amount of errors of different type)

  2. completeness (obvious, but not always easy to meet)

  3. metadata (met rather rarely, or at most at an elementary level)

  4. source reputation

  5. consistency (of a paramount importance when a larger corpus acquired from several sources is concerned)

  6. validation (ideal data)

  7. permissions & copyrights (a problematic topic, of course).

By considering the above factors, one can make informed decisions about which data sources to utilise for further research. From the above list of factors, certainly #4 (source reputation) is decisive. In our team, we rely on established collections of literary texts, such as the Gutenberg Project, ELTeC, DraCor, Bibliotheca Augustana, The Latin Library, Perseus Project, and the like initiatives.

3. What are the scholarly workflows that turn source material into data (extraction, transformation, unifying in a repository, etc.)? How do you develop a shared understanding about data with your collaborators and stakeholders?

Scholarly workflows for turning source material into data typically involve several stages:

  1. data acquisition, as discussed above

  2. data extraction in a broad sense (including digitizing from paper copies or extracting raw text data from different formats such as PDFs, HTML, etc.)

  3. data preprocessing, which usually means cleaning obvious errors, removing noise and formatting inconsistencies, improving metadata

  4. normalization, by which one can understand both standardising spellings and/or modernising, to ensure consistency across the dataset

  5. structural annotation if necessary

  6. grammatical annotation, or applying a set of NLP procedures in order to add information about grammatical categories (parts of speech), named entity recognition, etc.

  7. repository Integration: usually putting a dataset on GitHub and/or our Institute’s GitLab instance.

To develop a shared understanding about the data with collaborators and stakeholders, communication and documentation are essential. These include:

By fostering open communication and collaboration, researchers can develop a shared understanding of the data and ensure alignment among stakeholders throughout the research process.

4. What is the effect of legal or regulatory limitations on your research design and execution, as well as on your data sharing procedures? What were your relations with data providers and/or copyright holders?

Legal and regulatory limitations can significantly impact research design, execution, and data sharing procedures in computational text analysis of literary materials. In particular, in the area of Computational Literary Studies researchers would be very much interested in analyzing massive amounts of contemporary literature. For copyright reasons, however, this is very rarely possible, and this puts computational approaches to literature in a less comfortable position in comparison to traditional literary studies. Let’s face it: contemporary literature is and has always been the most vibrant topic in literary studies. In a way similar, however, is the situation of Ancient texts (say, Latin and Greek), as well as Early Modern ones. In order to acquire a reasonable dataset a researcher has to rely on outdated 19th-century editions that are freely accessible on Gutenberg Project and the like repositories. Using any state-of-the-art critical edition of Ancient or Medieval text might be virtually impossible due to the copyright reasons. Exceptions include The Riddle of Literary Quality project conducted at Huygens Institute (KNAW) in Amsterdam. On special permission from the publishers, the team of that project could use the original full-size contemporary novels, provided that all the books were accessed remotely using API queries. Consequently, the researchers did not see the books themselves, they were however allowed to conduct (remotely) any analysis they wished.

5. Do you release your datasets together with your research findings? If yes, in what formats / standards and which repositories? What kind of metadata is used?

Yes, releasing datasets alongside research findings has become a common practice in computational text analysis research; our team is no exception. The choice of formats, standards, and repositories for data sharing depends on the specific requirements of the research community and the preferences of the researchers. Obviously, however, TEI is a standard (or format) that first comes to mind. In several other applications raw text files can be processed directly, so the actual format is TXT. When grammatical annotation is concerned, we either comply with TEI, or the so-called vertical format, which is a combination of simple markup and formatting the text in three columns; the columns contain, in each row, a subsequent word form (as it appears in the text), its recognized lemma, followed by a part-of-speech tag.

Metadata accompanying the datasets varies, and depends on the purpose the dataset had been created. Bare minimum is to provide, for each text in a corpus, its author and title, and preferably also a publication year. In a more ideal scenario, which we try to comply with but rarely have sufficient resources to efficiently do so, it includes information such as:

By providing comprehensive metadata, researchers enable others to understand, evaluate, and effectively reuse the datasets for further analysis or validation. Adhering to community standards and best practices for data sharing facilitates collaboration, reproducibility, and the advancement of knowledge in computational text analysis and digital humanities.

6. How can you facilitate mutual understanding of each other’s data within your discipline? Do you have shared vocabularies, ontologies or metadata standards available?

Facilitating mutual understanding of data within the discipline of computational text analysis and digital humanities involves several strategies, including the use of shared vocabularies, ontologies, and metadata standards. In practice, however, our team has rather little to share in terms of, say, metadata standards. Rather than sharing our solution, we put as much effort as possible to comply with the already existing standards and shared vocabularies. An exception rather than a rule is our way to embed medatada about literary works in the filenames of the respective text files. Several years ago, for the purpose of conducting methodological experiments in authorship attribution, a simple way of using filenames to avoid any additional tables containing metadata was used, e.g. the file austen_pride.txt suggests to contain the novel Pride and Prejudice by Jane Austen. This simple way was first used in the stylometric software stylo and received a reasonable acceptance in the area of CLS studies (or even more broadly).

7. Have you ever approached a support research agent (such as a data steward, librarian, or archivist) and requested for or received their assistance? Could you name them? How cultural heritage professionals (archivists, librarians, museologists, etc.) can support your work?

It is easy to say that collaboration with cultural heritage professionals enables researchers to leverage their expertise and resources to enhance the quality, accessibility, and impact of their work in computational text analysis and digital humanities. It’s more difficult to achieve, though. Direct example of such collaboration includes our project on the language change in Polish, for which we had to acquire a possibly large corpus of texts covering a few centuries. A big help was offered by the foundation Wolne Lektury: not only were we allowed to download their content, but they also consulted with us which 19th-century texts should be digitized first. A similar attitude towards digitizing resources in communication with potential users or stakeholders is presented by the librarians from the Polish National Library working on the collection Polona.

8. Have you ever used tools supporting the principles of Open Science, FAIR or CARE in your Data Management Plan, such as automated FAIR metrics evaluation tools (FAIR enough, the FAIR evaluator, data management platform, or others)?

Not really. While there is awareness of their existence in our team, we rarely (if at all) incorporate them into our research workflows. In the context of the project conducted by our team, the FAIR principles are relatively easy to keep an eye on (when the source materials are not copyrighted or not possible to turn into FAIR data at all). In such a case, any additional tooling is not really helpful. The tools, however, that indirectly contribute to keeping our datasets fair, is a set of open-source technologies we adopted at the Institute. These include a version control system GitHub that enables researchers to manage changes to their datasets and track the evolution of data over time. We routinely deposit our code and datasets either on GitLab, or on GitHub. By doing this, we hope to promote transparency, reproducibility, and collaboration, which certainly aligns with the principles of FAIR, Open Science, and CARE (Collective Benefit).

9. Have you ever found it difficult to replicate findings obtained by another researcher or to reuse the data that served as the foundation for those conclusions? What was the main reason behind irreproducibility?

Many times it happened; more often than not. I could even say that replication is routinely an issue, for different reasons, ranging from the so-called software dependencies (e.g. some Python libraries cannot be installed in older versions), to the code that contains very specific settings (e.g. absolute paths to the data, which by definition looks different on different machines). Some common reasons behind irreproducibility include:

Addressing these challenges requires concerted efforts to promote transparency, rigor, and openness in computational text analysis and digital humanities research. Emphasizing comprehensive documentation, data validation, open access to data and code, methodological transparency, and community standards can enhance reproducibility and facilitate the reuse of research findings by other scholars.

10. Are you aware of anyone who has attempted to replicate your findings? Was the endeavor a success?

Well, yes and no. Yes because there have been a few studies aimed at using my code and/or my dataset to repeat my analytical steps. And no, because these experiments are not replication studies in the strict sense. Usual practice is to take someone else’s code, and to apply it to an extended dataset, in order to confirm the validity of the original findings. Or, sometimes a different programming language is used to replicate an original analytical procedure (e.g. preprocessing) in order to use it in a more general setup.

As far as my experience suggests, replication in the sciences is an idea rather than an actual scholarly routine. An idea, because the code and the datasets should be made publicly available in order to make a study in question entirely replicable. In practice, however, it is much more beneficial for the sake of advancing human knowledge if a study in question opens a new perspective, and paves a way to replicate the original study in slightly different conditions. It is more beneficial, in my opinion, to extend the original study by new languages, new genres, or new literary periods. (Let alone the fact that an ideal replication study would be very difficult to publish).

11. According to you, has the institutional mindset around sharing data evolved over time? In particular, how did you find the attitudes of publishers, libraries and policy-makers?

Yes it did evolve, for sure. My own observations over the last 20 years or so definitely confirm that such an evolving attitude cannot be denied. The institutional mindset around sharing data has evolved over time, driven by various factors such as technological advancements, changes in scholarly communication practices, and shifting cultural norms regarding openness and transparency in research. Here’s how attitudes of publishers, libraries, and policy-makers have evolved:

Overall, there has been a noticeable shift towards greater support for data sharing and open science principles among publishers, libraries, and policy-makers. While challenges remain, such as ensuring data privacy and addressing concerns about intellectual property rights, the growing emphasis on transparency, collaboration, and knowledge exchange is driving positive change in research culture and practices.