KALLIMACHOS unites humanities scholar, computer scientists and librarians in a regional Digital-Humanities-Center. The cooperations and competences already present at the University of Würzburg are complemented by partners at the DFKI Kaiserslautern (OCR) and at the University of Erlangen-Nürnberg (linguistic computer science). The deployment of the center will be granted by the Federal Ministry of Research and Development (Bundesministerium für Forschung und Entwicklung) as part of the assistance measures e-Humanities until the third quarter of 2017.
Task: Information extraction from german novels
Automatically extracted social networks for Goethes: “Die Wahlverwandtschaften”. The left picture shows the ten most connected characters when an interaction is created for a common appearance in a paragraph. The right picture shows the corresponding network when only direct speech is used as interactions.
Our aim is to extract attributed character networks from german historic novels. This task involves a numerous amount of preprocessing in the novel itself, including:
- Tokenizing and sentence detection
- Part of Speech Tagging
- Named Entity Detection
- Coreference Resolution
- Relation Extraction
Our current work is mainly focused on the last three steps in this pipeline architecture. The used techologies varies from classical rule-based appraches to machine learning including the hyped Deep Learning.
Task: Layout Analysis of (medievil) Printings
The digitalization of many medieval printings like Sebastian Brant´s "Ship of Fools" (Wiki EN, Wiki DE) is a key part of the KALLIMACHOS project. To optimize the results achieved by OCR (Optical Character Recognition) methods on a page, a preceding analysis of the layout is needed. As shown below (left: original image, right: segmentation result), the pages of medievil printings can contain much more than just plain text (red). Images and initials (both green) do frequently occur as well as additional information like marginalia (yellow), page numbers (purple), headings and footings or ornaments like bordures.
The complexity and variety of page layouts often cause standard, fully automated open source segmentation tools to fail. Therefore, we apply a user-assisted approach based on connected component analysis. During the initializing phase, some parameters have to be optimized by segmenting several representative pages and adjusting accordingly. As soon as satisfactory segmentation results are achieved, the remainder of the current book is segmented automatically. The results are stored using the PageXML standard.