Christof Schöch

1. Could you describe your research project, which required gathering and disseminating a large amount of data?

As the example in this response, I use our project “Zeta and Company” on understanding, modeling, implementing, evaluating and using keyness measures for Computational Literary Studies.

2. How do you discover data that is relevant for your research and which factors help you to assess its quality and trustworthiness?

In this case, the object of study is the French contemporary high-brow and low-brow novel. We have obtained items from a large array of sources: commercially-available EPUB files, freely-available EPUB or HTML files, but mostly we digitise pocket editions of these novels ourselves.

3. What are the scholarly workflows that turn source material into data (extraction, transformation, unifying in a repository, etc.)? How do you develop a shared understanding about data with your collaborators and stakeholders?

We scan the books using a document scanner. The books are destroyed in the process for faster scanning. OCR is then applied, then a transformation to simple XML-TEI. A metadata table is managed using a Google spreadsheet for ease of collaborative editing. The text files are stored in a private Github repository in various file formats (XML-TEI, plain text but also non-readable formats (more on that see below). The novels are selected with the aim of covering several decades equally (1950s-1990s), several sub-genres equally (crime, sentimental, detective and general high-brow), also in combination. Gender of authors plays a role, as well, but author gender is highly skewed depending on genre.

4. What is the effect of legal or regulatory limitations on your research design and execution, as well as on your data sharing procedures? What were your relations with data providers and/or copyright holders?

Virtually all of our texts are subject to copyright and cannot be shared freely. This limits us considerably. As a solution, we create so-called “derived text formats” or “extracted features” from our full texts, for example by randomising the order of the words within segments of a certain size while maintaining the order of the segments, or by replacing a certain proportion of words by their POS tags. Many keyness measures still work quite well with such materials.

5. Do you release your datasets together with your research findings? If yes, in what formats / standards and which repositories? What kind of metadata is used?

Yes, we do, that is standard procedure in all of our projects. Except that in this case, we can only share the code and the “derived text format” version of the texts.

6. How can you facilitate mutual understanding of each other‘s data within your discipline? Do you have shared vocabularies, ontologies or metadata standards available?

Well, there are some general standards like Dublin Core, CIDOC-CRM, LIDO or (of course) TEI. But for those things that are closely related to literary studies, there are no shared and agreed vocabularies, for example vocabularies, taxonomies or ontologies for things like narrative perspective, genres and sub-genres, poetic forms, etc. In another project, called “Mining and Modeling Text”, we approaches this issue to some extent. It is not a major concern in “Zeta and Company”.

7. Have you ever approached a support research agent (such as a data steward, librarian, or archivist) and requested for or received their assistance? Could you name them? How cultural heritage professionals (archivists, librarians, museologists, etc.) can support your work?

Not really.

8. Have you ever used tools supporting the principles of Open Science, FAIR or CARE in your Data Management Plan, such as automated FAIR metrics evaluation tools (FAIR enough, the FAIR evaluator, data management platform, or others)?

Not in the “Zeta and company” project, but we did evaluate our corpus in the “Mining and Modeling Text” project. We describe this (informal) process here (in German): Röttgermann, Julia, und Christof Schöch. 2020. “FAIRe Daten in den Literaturwissenschaften? Das Beispiel‚ Mining and Modeling Text und der französische Roman des 18. Jahrhunderts”. Romanistik-Blog. Blog des Fachinformationsdienstes (blog).

9. Have you ever found it difficult to replicate findings obtained by another researcher or to reuse the data that served as the foundation for those conclusions? What was the main reason behind irreproducibility?

Definitely! I have attempted replicating earlier research a number of times. The challenges are different every time. When trying to replicate earlier, non-digital work, there is often neither code (but rather a research method) nor data (just the indication of relevant literary works, sometimes not even specific editions). In other cases, data and/or code might be difficult to locate. I’ve also used replication as a teaching method, because students really need to understand the details of a method, and get into the inner workings of data and code, to make it work. For some documentation of these attempts, see personal web site.

10. Are you aware of anyone who has attempted to replicate your findings? Was the endeavor a success?

In fact, sort of. It was not directly an attempt to replicate our findings, but someone re-used our data for a new investigation of the same research question with a new set of methods. Still, it was very cool and shows the usefulness of publishing data. The re-using paper.

11. According to you, has the institutional mindset around sharing data evolved over time? In particular, how did you find the attitudes of publishers, libraries and policy-makers?

Attitudes have definitely changed in the right direction. Certainly with libraries and policy-makers and some publishers. But the big publishers just want to secure their part of the pie of the APC budget or the transformative deals, that’s all very negative. Scholars need to take scholarly publishing in our own hands and create the infrastructures and habits for diamond open access.– More and more journals have policies on code and data sharing, e.g. Cultural Analytics but also our own journal, the Journal of Computational Literary Studies. I’m very tempted to attempt some full and strict replication attempts on the articles from [our journal] itself. Really that should be part of the reviewing process, but it is not the case yet. – I look at evidence of people having understood and practiced Open Science quite a lot when reviewing applications or funding proposals, but I’m afraid that is not a standard perspective yet.