Artificial Intelligence – to serve or to be served?
Andrejs Vasiļjevs is a co-founder and Executive Chairman of Tilde, a leading European language technology and localization company. He also serves as a board member of the Multilingual Europe Technology Alliance META-NET and the Big Data Value Association, and has served on the board of the Latvian National Library Foundation. He drives innovation and fosters large scale industry and academia collaboration to advance AI-based solutions for multilingual Europe. Andrejs has received a PhD in computer sciences from the University of Latvia and is an Honorary Doctor of the Academy of Sciences of Latvia.
Despite the hype and mythology surrounding artificial intelligence, it is steadily penetrating almost all the areas of human activity. Memory institutions are placed to take a very special role in this age of AI. They can be pioneers in unleashing the power of AI to manage and process huge volumes of diverse information, extract knowledge out of raw data, and remove language and accessibility barriers. There are myriad ways that AI can serve memory institutions. At the same time the relationship with AI should become reciprocal. For AI to become smarter and more powerful it needs more and more quality data to learn from. It’s the new role of libraries to serve not just humans but also AI by providing its machine learning algorithms with the quality data that AI depends on. On the example of developments in Latvia, we will discuss the new opportunities opened by AI, the related challenges and the unique role of memory institutions in driving the advancement of AI.
Running a knowledge graph: Lessons learned from the Japanese Visual Media Graph
Tobias Malmsheimer, Magnus Pfeffer
Tobias Malmsheimer holds a M.Sc. in computer science and is currently responsible for the IT infrastructure in the Japanese Visual Media Graph project.
Magnus Pfeffer is a professor at Stuttgart Media University. His research interests include knowledge organization, linked open data and information retrieval.
Knowledge graphs can be understood as a collection of interlinked descriptions of concepts, entities, relationships and events. In the Digital Humanities, many projects work on integrating existing information from lexical, bibliographical and other sources into domain specific knowledge graphs. Our project, the Japanese Visual Media Graph (JVMG) is one such example. Like most knowledge graphs, it is using the Resource Description Framework (RDF) as the foundation for its data model.
In the proposed talk we would like to report on lessons learned from building and running the JVMG knowledge graph. We tested several open source triple store and graph database solutions and compared various aspects of their implementation and operation. On the one hand we examined query performance and resource usage. On the other hand we also explored subjective aspects like ease of use and installation and the available feature set.
While the measurements and observations are based on domain specific use cases, we believe they offer some insights into possible bottlenecks and pitfalls and should be a useful starting point for similar endeavors.
While libraries and archives as well as museums have traditionally taken care of analogue research data, there remain challenges in dealing with the rapidly increasing amounts of digital research data.
In digital humanities projects, digital artefacts — such as digitised sources, databases or datasets, project websites, (analytic) algorithms, or (interactive) visualisations —, often constitute the primary foundation of research results or even are their primary outcome. One of the main challenges is that digital research data and programming logic as well as presentation have often been entangled. As a result, the preservation of digital research data for long-term access and subsequent re-use often becomes problematic, because of the lack of technical standards and its dependence on outdated software.
To address this challenge, a team of librarians and IT researchers at the Max Planck Institute for the History of Science (MPIWG) began to develop DRIH, the digital research infrastructure for the humanities – that is meant to secure research data generated in the many projects of the institute.
Software Architecture for the Automatization of Subject Indexing
After obtaining his Master of Science in Computer Science the presenter worked multiple years in software development and administration. In 2020 he joined the group “Automatization of Subject Indexing” at ZBW – Leibniz Information Centre for Economics. He is mainly responsible for the design and implementation of the presented software architecture.
Subject indexing, i.e., the enrichment of metadata records for textual resources with descriptors from a controlled vocabulary, is one of the core activities of libraries. Due to the proliferation of digital documents it is not possible to annotate every single document intellectually. Therefore, we need to explore the potentials of automation on every level.
At ZBW, efforts to automate the subject indexing process started as early as 2000. In 2014 the decision was made to start doing the necessary applied research in-house by establishing a PhD position. However, the prototypical machine learning solutions that were developed over the following years were yet to be integrated into productive operations at the library. Therefore in 2020 an additional position for a software engineer was established and a pilot phase was initiated (planned to last until 2024) with the goal to complete the transfer of our solutions into practice by building a suitable software architecture that allows for real-time subject indexing integrated into existing metadata workflows at ZBW.
Since the beginning of the pilot phase, we have designed and implemented a microservice software architecture which periodically fetches relevant metadata records for our resources
(one criterion is that they do not contain intellectual subject indexing) in order to automatically index them with our trained models. These automatically created descriptors are written back to the database that feeds the ZBW search portal if they meet quality criteria which where developed in agreement with our subject indexing experts. Beyond that, these descriptors
are provided as suggestions to the subject indexing experts within their subject indexing platform.
The system is continously updated and developed. It runs on a Kubernetes cluster inside the ZBW. We use state of the art methods of software development, e.g.: separation of functionalities using microservices, near to 100 % test coverage, code reviews and
continous integration. The microservices are implemented in Python using the popular webframework FastAPI, which allows to build OpenAPI conformant REST APIs. We use a self hosted GitLab instance for code collabaration and continous integration.
In this talk, we present the software architecture in some technical detail and its interactions with external systems. We also discuss our implementation decisions and chosen technologies for implementation, deployment, administration, and monitoring.
Automatic Analysis of the Dewey Decimal Numbers with coli-ana
Uma Balakrishnan, Jakob Voß
Uma Balakrishnan holds a master’s degree in Library and Information Science. She leads the project coli-conc and knowledge organisation system (KOS) related projects at the head office of the Common library network (VZG). Her interests include KOS, metadata standards and interoperability. She represents the VZG at various expert panels such as the AG-K10plus SE ( Expert Group for K10plus Subject indexing), the FAG-EI (GBV Expert Group for Information and Indexing), the European Dewey User Group and the Regensburg Verbund Classification User group.
Jakob Voß studied computer science, library science, and philosophy with a PhD degree from the Humboldt University, Berlin. He works in research and development at the headquarters of the Common Library Network (VZG) in Göttingen. His main interests include data modeling, interoperability and knowledge organization.
Dewey Decimal Classification (DDC) is the most widely used classification system internationally. At the dawn of the 21ᵗʰ century, DDC attracted great interest amongst academic libraries in Europe, and was translated into German and several other European languages.The contemporary Dewey Decimal System with over 51.700 classes, six tables, and a complex number building system with intricate rules and instructions allows a huge flexibility and fineness in building new numbers for any given concept. An example of such a built number is: 641.50902 medieval cooking, which is built with 641.5 cooking from the main schedule and from the table T1-092 medieval, under which documents for medieval cooking are classified. The K10plus union catalogue currently holds over one million unique DDC built numbers . However, the lack of captions in such built numbers makes it difficult for non-Dewey experts to understand their meaning and make use of these for information retrieval. Therefore, in 2003 under the VZG project colibri, a sub-project coli-ana was initiated to develop a tool that would automatically decompose and analyze any given DDC notation to its tiniest non-decomposable component, enrich these with their caption and thus make the semantic relationship between each element apparent. The results of coli-ana improve information retrieval, facilitate their re-use for cataloging, and aid in enhancing knowledge organization systems in general. The service currently offers DDC captions in German and in Norwegian language. It is planned to include the English version of the DDC as well. However, this would be then subjected to license. In addition, the analyses are incorporated in the mapping tool Cocoda to assist mappings to and from DDC built numbers.
This talk will give a brief background of coli-ana and the challenges involved in developing such a tool. It will also present the infrastructure, the web service of the tool and show how to use the results of the analysis to improve subject indexing of collections that make better use of the DDC. The catalog enrichment is exemplified with data cleaning and extension of DDC in K10plus union catalogue.
Derivaud: Integrating poorly structured library metadata into a MARC21 union catalogue
Michael Hertig is system librarian at the University of Lausanne Library, Switzerland, with a focus on authority data and Subject indexing.
The Renouvaud library network, based in Lausanne, Switzerland, groups 140 diverse libraries. One of its primary missions is to integrate the many libraries of the Vaud canton, and for the past 5 years has integrated about 10 new libraries per year. Candidate libraries are of various types: primary and high school, higher education, research, public, and so on, and often have data formatted in a format other than the MARC library standard.
Integrating new libraries represents the challenge of creating thousands of new records in the union catalogue every year. Catalogers at the integrated libraries cannot catalog so many records during the few months of the integration period. In order to fulfill that task, Renouvaud benefits from a copy-cataloging tool since 2019, named Derivaud. This tool analyses the metadata of candidate libraries and retrieves the corresponding MARC21 records from various open repositories such as the Renouvaud catalogue itself, the catalogue of the Swiss Library Service Platform or the catalogue of the Bibliothèque nationale de France (with native UNIMARC records converted into MARC21). Derivaud compares source metadata with data in the remote repositories and matches them automatically. Candidate records with a matching score not high enough are then processed by librarians on a dedicated interface where they can select the correct candidate record.
In this presentation, we will introduce the metadata integration workflow as well as the mechanics of the Derivaud tool.
National libraries in an increasing manner are stepping outside of locally used ILS and steering towards new technologies and possibilities that weren’t feasible before – bridging silos, multilingual support, an alternative to traditional authority control, etc. Although there is no right or prevailing answer, the direction is set i.e. the semantic web and linked data that are influencing how bibliographic and authority data are created, shared, and made available to potential users. In this paper, I will focus on modeling, mapping, and creating workflows for moving the bibliographic and authority data of the National Library of Latvia to Wikidata as a major hub of the semantic web. The main focus will be on modeling and integrating library authority files (persons, organizations, works, locations, etc.) to a full extent into Wikidata, reducing any possible data loss to a minimum. In this way, Wikidata will not be used as just a hub for institutional identifiers but as a universal and collaborative knowledge base where libraries can truly share and disseminate their knowledge, connect it and use it to describe and query their own resources.
TEXT ANALYSIS & OCR
An Integrated Text Analysis Platform for all levels of Technical Expertise
Pascal Belouin is a Digital Humanities specialist from the Max Planck Institute for the History of Science, where he supports historians of science in applying digital methods to their research, and explore various DH research topics such as, among other things, knowledge modelling, distant reading, network analysis, and data visualisation. He is also interested in finding novel infrastructural solutions to the problem of long-term research data preservation through the use of decentralised storage solutions such as IPFS.
In this talk will be presented the design principles that guided the development of a python-based web application and NLP pipeline which aims to facilitate the exploration of the numerous digital multilingual textual corpuses currently made available by libraries and research institutions.
This platform allows the use of an array of digital humanities methods such as fuzzy search, sub-corpus creation, named entity extraction, topic modelling, tf-idf, or co-occurrence networks, and also features an exhaustive and extensible natural language processing pipeline (from annotation to model training) based on Spacy.
Thus, will be demonstrated the platform, and detail the design choices made during its development in order to make sure it is adaptable to all levels of technological savvy and able to cater a large range of different multilingual textual corpuses. The talk will be concluded by briefly exposing the various way in which the platform could evolve going forward.
Finetune your OCR! Improving automated text recognition for early printed works by finetuning existing Tesseract models
Thomas Schmidt, Jan Kamlah
Thomas Schmidt: since 2021 project coordinator “OCR-D: Workflow for work-specific training based on generic models with OCR-D as well as ground truth enhancement”, funded by DFG, Mannheim University Library.
Jan Kamlah: since 2017 project collaborator and software developer in various DFG-funded OCR projects at Mannheim University Library.
For some years now, GLAM institutions have been working with free or commercial text recognition software to generate full texts from digitized early printed works. Especially printed works from the 16th to the 19th century pose challenges for OCR solutions: historical fonts (e.g. Fraktur / Blackletter), heterogeneous layouts and typographic peculiarities as well as the broad variety of medium and print quality can impair optimal OCR results. Yet, precisely these collections benefit from high-quality full texts: sometimes only few physical copies of a historical print still exist, or conservation conditions do not permit in situ use. The digital full text gains an important role in such cases to make historical printed works and thereby an important part of our cultural heritage accessible to the public.
With the help of a work-specific finetuning workflow, the quality of text recognition can be significantly increased. Furthermore, finetuning an existing OCR model for special characters (e.g. astronomical or mathematical symbols, currency signs etc.) is a time efficient way to improve OCR results for a specific collection of works, as the creation of ground truth for training and the training process itself are quicker than training new models from scratch. Using selected examples and the OCR engine Tesseract, our presentation aims to highlight use cases and general advantages of a finetuning workflow that can be transferred to other OCR engines (e.g. Calamari) as well. We will show different approaches to creating a suitable finetuning workflow which is tailored to the intended use.
In the last few years open source Tesseract OCR software has significantly improved its accuracy since moving to LTSM(Long short-term memory) based engine. For popular European scripts very accurate models have been trained rivaling and at times beating commercial offerings. Accuracy was still lacking for more obscure scripts such as Latvian Fraktur script from the 19th to early 20th century.
As with most supervised machine learning models accurate and plentiful labeling is of paramount importance. To improve accuracy author created a pipeline of document extraction and feeding into a crowdsourcing website. Newly labeled data was used in supplemental training on existing models. The resulting models improved on older digitalization efforts at National Library of Latvia.
Anyone working with non-trivial metadata can testify that data is inherently dirty. Rules or expectations about data are violated regularly and subtle errors often slip through. Data validation can help to detect deviations from standards but its application requires special knowledge and tools. The presentation will introduce an open source validation service used at VZG to validate data against rules and schemas in different languages and formats (JSON, XML, RDF, MARC…). Motivation, examples, and difficulties of validation are explained and discussed. The service is available at https://format.gbv.de/validate/.
How language technologies could help cultural heritage
Since 2004 Jānis Ziediņš is working at Culture information systems centre of Ministry of Culture (Latvia), currently as Deputy director for ICT to help and lead IT projects in cultural sector. Janis has been involved in different types of culture related IT projects – most important of them – State Unified Information System for Archives and Joint Catalogue of the National Holdings of Museums, and development of Latvian state administration language technology platform Hugo.lv.
Hugo.lv is a Latvian state administration language technology platform that is freely accessible to every resident of Latvia. It provides automatic translation, speech recognition and speech synthesis, as well as a range of tools for supporting multilingual features in e-services. Therefore Culture information systems centre has done some test cases and real life integrations to expand access for cultural resource by automatic translations, speech recognition and synthesis.
SUSTAINABLE LINKED DATA
The Rijksmuseum Integration Layer: a Linked Data Infrastructure for Sustainable Data Services
Chris Dijkshoorn, Edward Anderson
Chris Dijkshoorn is the Head of the Rijksmuseum’s Collection IT department. Edward Anderson is the Data Engineer in the Collection IT department.
Presentation for people interested in the organisational and technical aspects of implementing a Linked Data infrastructure to integrate data from multiple domains to create rich and sustainable data services.
The Rijksmuseum is building a new platform to serve users with rich and interconnected data about its collection, library and archive. A platform for data describing over a million artworks, objects, books and documents charting Dutch art and history. This presentation introduces the Rijksmuseum Integration Layer, its architecture and implementation, and explains how Linked Data principles are enabling us to bridge systems, integrate metadata and deliver sustainable data services.
The Integration Layer is inspired by microservice architectures and the Semantic Web. Linked Data is the key to its design, with identifiers and ontologies providing a foundational shared data structure. At the core of the platform is a knowledge graph served by extract-transform-load pipelines producing and abstracting standards-compliant data models. This data layer is realised by a constellation of loosely-coupled containerised Python applications deployed with Kubernetes in the Azure cloud.
Our Integration Layer is a collection of simple software components underpinned by Web and cultural heritage sector standards. Under the hood it’s really just HTTP, XML, XSLT, RDF, SPARQL, MARC, CIDOC-CRM, RDA, EDM and Dublin Core glued together with minimal application code. The platform isn’t fully built yet, but it is already maturing into a stable and maintainable infrastructure for the future.
The application of artificial intelligence, and in particular deep learning approaches, to the cultural heritage domain has attracted significant attention over the last years. Most of the existing work focuses on automatic metadata annotation with information such as the author, medium, image classification by style, topic, etc. or the objects that were detected in images from open datasets. Focus of Saint George on a Bike project is on generation metadata, which is related more specifically for cultural heritage domain. First of all, rich metadata would allow a visitor of a cultural heritage site or the user of a web-page to obtain a detailed description of an artwork and would facilitate a personalized interaction with GLAM institutions. Secondly, different types of metadata could be used to automatically generate explanations in catalogs, fuel search and browse engines, or fill in rich alt-tab descriptions on websites that cater to minorities such as visually impaired citizens. Generating metadata automatically can save a lot of time and labor for manual annotators.
The generation of metadata for paintings or images of cultural heritage objects is challenging compared to those corresponding to real world scenes, for several reasons. First, the metadata for paintings often contain irrelevant information beyond the image content such as the life of a historical person, information about the place where the object was found, or the life of the painter. The second challenge is the quality of the data and the data collection process. This makes it difficult to train with a dataset similar in size to datasets containing real life images, such as MS COCO. Lastly, metadata for cultural heritage objects from data providers often contain incomplete sentences or can be in different languages. Data aggregators can’t distinguish such cases during data incorporating, as a result, they end up as part of the metadata and affect negatively the quality of the datasets.
The goal of Saint George on a Bike is to provide rich information about European cultural heritage pictorial artwork. More than one type of output may be generated, which fundamentally depends on the type of input available. The levels of semantic output that we currently contemplate are the following: Semantic resources in form of tags coming from existing vocabularies, Textual captions. this presentations will present designed and implemented our solutions and resources that can generate textual tags, captions and additional metadata, which can be useful for GLAM organizations. We have identified the controlled vocabulary from which to choose semantic tags based on the Europeana Entity Collection tags, and we are in the process of (1) refining this vocabulary and (2) considering related sources such as DBpedia, Wikidata, or more specific vocabularies used by Europeana providers. We explain the system and module-level architecture for each of these techniques, as well as their implementation.
As we move through the digital age, certain areas of the library are becoming less and less discoverable. In 2021, Lean Library published the Librarian Futures report, with various findings on user-centricity. Modern patron workflows now begin outside the library, with 79% of faculty and 74% of students beginning discovery outside of the library’s tools, on websites such as Google Scholar. Modern consumers are used to ‘point of need’ information, getting the content and information where and when they need it, rather than having to leave their workflow and look for it elsewhere.
The report also uncovered the lack of awareness of the full extent of library services available to patrons; when students were asked what sources of information they used the most, the library was used the same amount as Wikipedia. This disconnect between the library services available and patron usage may be due to the large number of libraries that have not yet embedded their services around the workflow of their users.
This presentation will include a summary of key findings from the Librarian Futures report, along with an overview of how Lean Library increases discoverability of library resources and services, ensuring collections remain easily accessible to patrons in their workflow, and fostering an open, collaborative community. Lean Library continues the transition of the library into a digital space, bridging the gap between the physical and digital collections, and bringing specialist knowledge to the wider library community.