Deutsch Intern
    Chair of Computer Science VI - Artificial Intelligence and Applied Computer Science

    Layout Analysis and Region EXtraction (LAREX)

    LAREX is a semi-automatic open-source tool for layout analysis on early printed books. It uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if necessary. The PageXML format is used to support integration into existing OCR workflows. Evaluations showed that LAREX provides an efficient and flexible way to segment pages of early printed books.

    Goal

    While a straight forward text/non-text separation might be sufficient in order to ensure a high quality subsequent OCR result, many users require an even finer-grained segmentation including a precise semantic classification of the detected text segments.

    The Figure below shows an example input image and the corresponding desired segmentation output including images and a swash capital (green) as well as running text (red), heading (blue), marginalia (yellow) and sheet title (cyan) elements.

    Recent Applications and Results

    In an effort to support the Narragonien digital project three editions of the the Ship of Fools were segmented using LAREX. For all three books a detailed semantic classification similar to the one shown above was required. The segmentation process was performed by a student research assistant of the Narragonien digital project who had some prior experience with LAREX but no background in computer science, image processing or layout analysis. On average the segmentation of one book comprising of ~300 pages took less than 4 hours. The subsequent OCR yielded excellent recognition results both on character (above 98%) and word (above 92%) level indicating the precision of the obtained segmentation.

    Current Status and Contact

    In the last few weeks a new, browser based GUI has been built from scratch. This also required some substantial changes within the rest of the code. We are now working on a step by step adaption and integration of the existing functionality. Nevertheless, feel free to start testing right away but please keep in mind that it's work in progress.  The code is available at GitHub and a short user manual will be made released within the next few days.

    The Web Demo is available here. There are some known bugs when using browsers other than Chrome. These issues will be addressed as soon as possible. A short user manual can be downloaded here.

    If you have any questions or suggestions please feel free to contact me at
    christian.reul@uni-wuerzburg.de.

    Related Publications:

    Reul, C., Springmann, U., and Puppe, F.: LAREX - A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books.
    Accepted for oral presentation at DATeCH 2017. Draft available at arXiv.

    Reul, C., Dittrich, M., and Gruner, M.: Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488).
    Accepted for oral presentation at DATeCH 2017. Draft available at arXiv.

    Contact

    Lehrstuhl für Informatik VI (Künstliche Intelligenz und angewandte Informatik)
    Am Hubland
    97074 Würzburg

    Phone: +49 931 31-86731
    Email

    Find Contact

    Hubland Süd, Geb. M2