Federico Pianzola

1. Could you describe your research project, which required gathering and disseminating a large amount of data?

The “Graphs and Ontologies for Literary Evolution Models” is a 5-year (2023-2027) research project funded by the European Commission (ERC StG). The goal of the project is to create accurate models of how the (formal and content-related) cultural traits of fiction spread and combine. The data used are (fan)fiction stories in 5 different languages (English, Spanish, Italian, Korean, and Indonesian) gathered from various online platforms. The methodology mainly combines computational literary studies and cultural evolution theory, with influences from fan studies and information science.

2. How do you discover data that is relevant for your research and which factors help you to assess its quality and trustworthiness?

Discovery: Google Scholar alerts for topics of interest; following researchers on social media; checking data journals and repositories (OSF, CESSDA, Zenodo, Journal of Open Humanities Data) ; Assessment: I look at the documentation and try to understand how the data has been collected, annotated, transformed. I always double-check the methodology to make sure I agree with the modelling decisions made by others.

3. What are the scholarly workflows that turn source material into data (extraction, transformation, unifying in a repository, etc.)? How do you develop a shared understanding about data with your collaborators and stakeholders?

Workflow: discussion of theory and data modelling (e.g. taxonomies, psychological constructs, ontologies); extraction; transformation; enrichment using annotation; organisation based on standards, ontologies, and vocabularies; unifying in a repository.

4. What is the effect of legal or regulatory limitations on your research design and execution, as well as on your data sharing procedures? What were your relations with data providers and/or copyright holders?

Often I can’t share data because it is copyrighted. I’ve recently started sharing derived data and enriched metadata. I always inform data providers and online users when I’m using their data for research. In the case of large scale collection, I contact the website administrators and/or post an announcement on their websites.

5. Do you release your datasets together with your research findings? If yes, in what formats / standards and which repositories? What kind of metadata is used?

Yes. Always open formats: e.g. CSV. Metadata about the content, original source, creator, date, disciplines of interest.

6. How can you facilitate mutual understanding of each other‘s data within your discipline? Do you have shared vocabularies, ontologies or metadata standards available?

Making explicit and documenting all data modelling choices. Available ontologies don’t cover all aspects of research data in CLS, e.g. there is a big gap related to narrative and fictional elements. It’s important to start using more frequently existing ontologies: CIDOC-CRM, NLP Interchange Format, OntoLex, etc.

7. Have you ever approached a support research agent (such as a data steward, librarian, or archivist) and requested for or received their assistance? Could you name them? How cultural heritage professionals (archivists, librarians, museologists, etc.) can support your work?

Yes, at my own institution (University of Groningen). They can help in: - making data collection more comprehensive and accurate - clarify legal and ethical issues - support in documentation and sharing data

8. Have you ever used tools supporting the principles of Open Science, FAIR or CARE in your Data Management Plan, such as automated FAIR metrics evaluation tools (FAIR enough, the FAIR evaluator, data management platform, or others)?

Yes. OSF, Data Stewardship Wizard.

9. Have you ever found it difficult to replicate findings obtained by another researcher or to reuse the data that served as the foundation for those conclusions? What was the main reason behind irreproducibility?

Yes. Lack of clear documentation and commented code.

10. Are you aware of anyone who has attempted to replicate your findings? Was the endeavor a success?

Yes, I know of an ongoing attempt but they’re having troubles because we can’t share copyrighted data.

11. According to you, has the institutional mindset around sharing data evolved over time? In particular, how did you find the attitudes of publishers, libraries and policy-makers?

Only slightly changed in the humanities. Many scholars are still not aware of the importance of sharing data. I’m very happy that the EU is pushing hard their Open Science agenda, forcing others to adapt to it.