OCR Dataset Corpus

This corpus is created from the Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents, as available from the LINDAT repository.

The OCR dataset consists of individual textpages, transcribed using OCR and made available as image, hOCR file, raw text, and JSON data. Each folder corresponds to a book.

For this corpus, all hOCR files were converted to TEI/XML, with a single XML file per book. The high-resolution TIFF images available in the dataset were converted to JPG since TIF is not supported in HTML5, and downscaled to the size that was used for the OCR process. Each book was then part-of-speech tagged and parsed using UDPipe, and adorned with named entities using NameTag2.