Maciej Eder
1. Could you describe your research project, which required gathering and disseminating a large amount of data?
Our research projects at the Institute for Polish Language of the Polish Academy of Sciences focus on applying computational techniques to analyze texts. While these texts could be generally considered to be literary, non-literary sources (or even non-textual in the strictest sense) also fall into our area of interest. The goal of most of these projects is to uncover patterns, themes, and similarities across texts using methods such as clustering, topic modeling, and sentiment analysis. The actual size of the datasets might vary. Some of the projects conducted by our team include:
a series of studies on language change that required a few relatively large diachronic corpora of Polish and English
a PhD project on Multimodal stylometry, in which tons of TV series are being downloaded and statistically analyzed
a PhD project on Latin poetry that relies on a relatively small but very well-curated collection of Latin poems
a project on automatic detection of direct speech in literary corpora in several languages relying mostly on the high-quality corpus ELTeC containing 1200+ novels in several languages
a series of methodological studies aimed at testing different stylometric methods, for which a series of comparable corpora in a few languages – each containing a few dozen full-sized novels out of copyright – had to be collected.
2. How do you discover data that is relevant for your research and which factors help you to assess its quality and trustworthiness?
Discovering relevant data involves a combination of methods such as literature review, online databases, digital libraries, and collaborations with other researchers or institutions that specialise in digitising and archiving literary materials. Factors that help assess data quality and trustworthiness include:
accuracy (controlling the amount of errors of different type)
completeness (obvious, but not always easy to meet)
metadata (met rather rarely, or at most at an elementary level)
source reputation
consistency (of a paramount importance when a larger corpus acquired from several sources is concerned)
validation (ideal data)
permissions & copyrights (a problematic topic, of course).
By considering the above factors, one can make informed decisions about which data sources to utilise for further research. From the above list of factors, certainly #4 (source reputation) is decisive. In our team, we rely on established collections of literary texts, such as the Gutenberg Project, ELTeC, DraCor, Bibliotheca Augustana, The Latin Library, Perseus Project, and the like initiatives.
3. What are the scholarly workflows that turn source material into data (extraction, transformation, unifying in a repository, etc.)? How do you develop a shared understanding about data with your collaborators and stakeholders?
Scholarly workflows for turning source material into data typically involve several stages:
data acquisition, as discussed above
data extraction in a broad sense (including digitizing from paper copies or extracting raw text data from different formats such as PDFs, HTML, etc.)
data preprocessing, which usually means cleaning obvious errors, removing noise and formatting inconsistencies, improving metadata
normalization, by which one can understand both standardising spellings and/or modernising, to ensure consistency across the dataset
structural annotation if necessary
grammatical annotation, or applying a set of NLP procedures in order to add information about grammatical categories (parts of speech), named entity recognition, etc.
repository Integration: usually putting a dataset on GitHub and/or our Institute’s GitLab instance.
To develop a shared understanding about the data with collaborators and stakeholders, communication and documentation are essential. These include:
Collaborative meetings: regular meetings or discussions (in-person or hybrid) where team members can share their progress, insights, and challenges related to the data.
Documentation: creating detailed documentation that outlines the data collection methods, preprocessing steps, metadata format, and any assumptions or limitations associated with the dataset. These are all of paramount importance, but in practice, we adopted a rule-of-thumb saying that at least some basic documentation should always be provided.
Data samples: ideally, some data excerpts should be made available even for copyrighted materials, in order to illustrate the characteristics and potential insights of the dataset. In practice, since sharing the full-text datasets is restricted, we adopted a routine of sharing derived data on GitHub, such as word frequencies instead of original texts.
Feedback: establishing any feedback mechanism where collaborators can provide input, suggestions, or corrections to improve the quality and usability of the data. In our case, such feedback mechanism is mostly informal, via casual chats between team members.
Training and workshops: Providing training sessions or workshops to familiarize collaborators with the data, analytical tools, and methods used in the research project.
By fostering open communication and collaboration, researchers can develop a shared understanding of the data and ensure alignment among stakeholders throughout the research process.
4. What is the effect of legal or regulatory limitations on your research design and execution, as well as on your data sharing procedures? What were your relations with data providers and/or copyright holders?
Legal and regulatory limitations can significantly impact research design, execution, and data sharing procedures in computational text analysis of literary materials. In particular, in the area of Computational Literary Studies researchers would be very much interested in analyzing massive amounts of contemporary literature. For copyright reasons, however, this is very rarely possible, and this puts computational approaches to literature in a less comfortable position in comparison to traditional literary studies. Let’s face it: contemporary literature is and has always been the most vibrant topic in literary studies. In a way similar, however, is the situation of Ancient texts (say, Latin and Greek), as well as Early Modern ones. In order to acquire a reasonable dataset a researcher has to rely on outdated 19th-century editions that are freely accessible on Gutenberg Project and the like repositories. Using any state-of-the-art critical edition of Ancient or Medieval text might be virtually impossible due to the copyright reasons. Exceptions include The Riddle of Literary Quality project conducted at Huygens Institute (KNAW) in Amsterdam. On special permission from the publishers, the team of that project could use the original full-size contemporary novels, provided that all the books were accessed remotely using API queries. Consequently, the researchers did not see the books themselves, they were however allowed to conduct (remotely) any analysis they wished.
5. Do you release your datasets together with your research findings? If yes, in what formats / standards and which repositories? What kind of metadata is used?
Yes, releasing datasets alongside research findings has become a common practice in computational text analysis research; our team is no exception. The choice of formats, standards, and repositories for data sharing depends on the specific requirements of the research community and the preferences of the researchers. Obviously, however, TEI is a standard (or format) that first comes to mind. In several other applications raw text files can be processed directly, so the actual format is TXT. When grammatical annotation is concerned, we either comply with TEI, or the so-called vertical format, which is a combination of simple markup and formatting the text in three columns; the columns contain, in each row, a subsequent word form (as it appears in the text), its recognized lemma, followed by a part-of-speech tag.
Metadata accompanying the datasets varies, and depends on the purpose the dataset had been created. Bare minimum is to provide, for each text in a corpus, its author and title, and preferably also a publication year. In a more ideal scenario, which we try to comply with but rarely have sufficient resources to efficiently do so, it includes information such as:
Title
Author(s) or creator(s)
Date of creation or publication
Gender of the author
Genre of a literary work
License or terms of use
Data format and encoding
Source(s) of the data
Any preprocessing steps applied to the data
Relevant identifiers (e.g. DOI)
Keywords
By providing comprehensive metadata, researchers enable others to understand, evaluate, and effectively reuse the datasets for further analysis or validation. Adhering to community standards and best practices for data sharing facilitates collaboration, reproducibility, and the advancement of knowledge in computational text analysis and digital humanities.
6. How can you facilitate mutual understanding of each other’s data within your discipline? Do you have shared vocabularies, ontologies or metadata standards available?
Facilitating mutual understanding of data within the discipline of computational text analysis and digital humanities involves several strategies, including the use of shared vocabularies, ontologies, and metadata standards. In practice, however, our team has rather little to share in terms of, say, metadata standards. Rather than sharing our solution, we put as much effort as possible to comply with the already existing standards and shared vocabularies. An exception rather than a rule is our way to embed medatada about literary works in the filenames of the respective text files. Several years ago, for the purpose of conducting methodological experiments in authorship attribution, a simple way of using filenames to avoid any additional tables containing metadata was used, e.g. the file austen_pride.txt
suggests to contain the novel Pride and Prejudice by Jane Austen. This simple way was first used in the stylometric software stylo
and received a reasonable acceptance in the area of CLS studies (or even more broadly).
7. Have you ever approached a support research agent (such as a data steward, librarian, or archivist) and requested for or received their assistance? Could you name them? How cultural heritage professionals (archivists, librarians, museologists, etc.) can support your work?
It is easy to say that collaboration with cultural heritage professionals enables researchers to leverage their expertise and resources to enhance the quality, accessibility, and impact of their work in computational text analysis and digital humanities. It’s more difficult to achieve, though. Direct example of such collaboration includes our project on the language change in Polish, for which we had to acquire a possibly large corpus of texts covering a few centuries. A big help was offered by the foundation Wolne Lektury: not only were we allowed to download their content, but they also consulted with us which 19th-century texts should be digitized first. A similar attitude towards digitizing resources in communication with potential users or stakeholders is presented by the librarians from the Polish National Library working on the collection Polona.
8. Have you ever used tools supporting the principles of Open Science, FAIR or CARE in your Data Management Plan, such as automated FAIR metrics evaluation tools (FAIR enough, the FAIR evaluator, data management platform, or others)?
Not really. While there is awareness of their existence in our team, we rarely (if at all) incorporate them into our research workflows. In the context of the project conducted by our team, the FAIR principles are relatively easy to keep an eye on (when the source materials are not copyrighted or not possible to turn into FAIR data at all). In such a case, any additional tooling is not really helpful. The tools, however, that indirectly contribute to keeping our datasets fair, is a set of open-source technologies we adopted at the Institute. These include a version control system GitHub that enables researchers to manage changes to their datasets and track the evolution of data over time. We routinely deposit our code and datasets either on GitLab, or on GitHub. By doing this, we hope to promote transparency, reproducibility, and collaboration, which certainly aligns with the principles of FAIR, Open Science, and CARE (Collective Benefit).
9. Have you ever found it difficult to replicate findings obtained by another researcher or to reuse the data that served as the foundation for those conclusions? What was the main reason behind irreproducibility?
Many times it happened; more often than not. I could even say that replication is routinely an issue, for different reasons, ranging from the so-called software dependencies (e.g. some Python libraries cannot be installed in older versions), to the code that contains very specific settings (e.g. absolute paths to the data, which by definition looks different on different machines). Some common reasons behind irreproducibility include:
Insufficient documentation of data sources, preprocessing steps, analysis procedures, and software dependencies makes it difficult for other researchers to understand and replicate the research workflow. Lack of detailed documentation hinders reproducibility by obscuring key decisions and assumptions made during the research process.
Poor data quality, including errors, inconsistencies, and biases in the dataset, can lead to irreproducibility of findings.
Software dependencies: dependency on specific software tools, libraries, or platforms that are not well-documented, maintained, or accessible to other researchers can hinder reproducibility. Changes in software versions, compatibility issues, or lack of access to proprietary tools may prevent researchers from replicating analyses or reusing code.
Limited access to the original data sources or restrictions on data sharing due to privacy concerns, copyright issues, or licensing agreements can impede reproducibility. Without access to the underlying data, researchers cannot independently verify the findings or conduct alternative analyses.
Lack of transparency in reporting analytical methods, parameter settings, and decision criteria makes it challenging for other researchers to reproduce the results. Ambiguities in methodological descriptions, such as vague or incomplete descriptions of algorithms or statistical procedures, undermine reproducibility by introducing uncertainty into the research process.
Last but not least, irreproducibility may also result from human factors such as errors in data processing, coding, or interpretation, as well as differences in researcher expertise, experience, or domain knowledge. Without clear documentation and standardized procedures, variations in human judgment and interpretation can lead to inconsistencies in replication attempts.
Addressing these challenges requires concerted efforts to promote transparency, rigor, and openness in computational text analysis and digital humanities research. Emphasizing comprehensive documentation, data validation, open access to data and code, methodological transparency, and community standards can enhance reproducibility and facilitate the reuse of research findings by other scholars.
10. Are you aware of anyone who has attempted to replicate your findings? Was the endeavor a success?
Well, yes and no. Yes because there have been a few studies aimed at using my code and/or my dataset to repeat my analytical steps. And no, because these experiments are not replication studies in the strict sense. Usual practice is to take someone else’s code, and to apply it to an extended dataset, in order to confirm the validity of the original findings. Or, sometimes a different programming language is used to replicate an original analytical procedure (e.g. preprocessing) in order to use it in a more general setup.
As far as my experience suggests, replication in the sciences is an idea rather than an actual scholarly routine. An idea, because the code and the datasets should be made publicly available in order to make a study in question entirely replicable. In practice, however, it is much more beneficial for the sake of advancing human knowledge if a study in question opens a new perspective, and paves a way to replicate the original study in slightly different conditions. It is more beneficial, in my opinion, to extend the original study by new languages, new genres, or new literary periods. (Let alone the fact that an ideal replication study would be very difficult to publish).
11. According to you, has the institutional mindset around sharing data evolved over time? In particular, how did you find the attitudes of publishers, libraries and policy-makers?
Yes it did evolve, for sure. My own observations over the last 20 years or so definitely confirm that such an evolving attitude cannot be denied. The institutional mindset around sharing data has evolved over time, driven by various factors such as technological advancements, changes in scholarly communication practices, and shifting cultural norms regarding openness and transparency in research. Here’s how attitudes of publishers, libraries, and policy-makers have evolved:
Publishers: many publishers have increasingly embraced open access and data sharing initiatives in response to growing demands for transparency and reproducibility in research. Some publishers now require authors to provide access to underlying data as a condition for publication, and they may offer data repositories or supplemental materials hosting options to facilitate data sharing. Additionally, publishers are implementing policies to promote data citation and encourage proper attribution of shared datasets, recognizing the value of data as a scholarly output.
Libraries play a pivotal role in supporting data sharing and management efforts within the research community. They provide infrastructure, services, and expertise to help researchers organize, preserve, and disseminate their data effectively. Libraries may offer data management planning support, data repository services, data curation assistance, and training programs to promote best practices in data sharing and stewardship. Moreover, libraries advocate for open access principles and collaborate with publishers, researchers, and policymakers to advance open science initiatives and facilitate broader access to research outputs, including data.
Policy-makers at various levels, including funding agencies, government, and international organizations, have recognized the importance of data sharing for accelerating scientific discovery, fostering innovation, and maximizing the societal impact of research investments. As a result, many funding agencies (such as the Polish agency NCN) have implemented data sharing requirements as part of grant conditions, mandating researchers to deposit their data in repositories, adhere to data management plans, and make data publicly accessible. Additionally, policy-makers are promoting open science policies and infrastructure development to enhance data discoverability, accessibility, and reuse across disciplines and borders.
Overall, there has been a noticeable shift towards greater support for data sharing and open science principles among publishers, libraries, and policy-makers. While challenges remain, such as ensuring data privacy and addressing concerns about intellectual property rights, the growing emphasis on transparency, collaboration, and knowledge exchange is driving positive change in research culture and practices.