Content-based annotation of page images from the (archaeological) historical archive
Please use the following text to cite this item or export to a predefined format:
Lutsai,Kateryna and Křivánková,Dana, 2025,
Content-based annotation of page images from the (archaeological) historical archive, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/20.500.12800/1-5959.
Authors
Item identifier
Date issued
2025-10-10
Size
91 Gb,
48499 images,
11 categories
Description
This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format. The scheme distinguishes between visual content (drawings, maps, paintings, schematics, and photographs), textual content (handwritten, printed, or machine-typed), and hybrid formats that combine multiple elements. Special attention is given to layout characteristics, with separate labels designated for content presented in tabular or form-like structures versus paragraph or block formats. For instance, we differentiate between standard drawings (DRAW📈) and drawings with table-based legends (DRAW_L📈📏), as well as between regular photographs (PHOTO🌄) and those embedded within tabular layouts (PHOTO_L🌄📏).
The textual categories are particularly nuanced, distinguishing between three input methods—handwritten (✏️), printed (📄), and machine-typed (📄)—and further subdividing these based on structural organization. Text can appear in either tabular/form-like arrangements (LINE_HW, LINE_P, LINE_T) or in traditional paragraph/block formats (TEXT_HW, TEXT_P, TEXT_T). An additional TEXT📰 category accommodates mixed documents that combine multiple text types or include minor graphical elements, providing flexibility for complex real-world documents.
The dataset is organized using a 5-fold cross-validation structure, with each fold maintaining an 80-10-10 split for training, development, and test sets respectively. This partitioning information is documented in an accompanying CSV file, enabling robust model evaluation and the potential for ensemble approaches where models trained on different folds can be averaged together to create a more robust combined model, provided they share the same base architecture.
Publisher
Acknowledgement
European Commission
Project code:EC/HORIZON-RIA/101132163/EU
Project name:Advancing FronTier Research In the Arts and hUManities
Subject(s)
Collections
Files in this item
- Name
- DATASET_FOLDS_ANNOTATION.csv
- Size
- 2.32 MB
- Format
- text/csv
- Description
- CSV
- MD5
- de0ce1f9da169a8fed47a34e835e952f

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- stacked_timeline_graph.png
- Size
- 183.01 KB
- Format
- image/png
- Description
- PNG
- MD5
- b2151ac48e06573e0454adb49ca108b8

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- DRAW.zip
- Size
- 8.56 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 13c11b982458c04e081d600c5bd54d60

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- DRAW_L.zip
- Size
- 4.87 GB
- Format
- application/zip
- Description
- Zip
- MD5
- f0a63ee10767b404e3c03c8556614b64

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- LINE_HW.zip
- Size
- 5.18 GB
- Format
- application/zip
- Description
- Zip
- MD5
- ca833a3012ceab5db167b0a2910382f4

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- LINE_P.zip
- Size
- 1.46 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 2e3afc3d97e1917a91c3484e48aa6514

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- README.md
- Size
- 5.32 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- f36452da35f38f7e8f814d015e8236c7

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- PHOTO.zip
- Size
- 7.3 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 8c8cc51c29a7c8e955e3838c8d339a55

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT_HW.zip
- Size
- 4.34 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 64ab17485a0b55bec7561586ee384b9f

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT_T.zip
- Size
- 5.3 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 840daff9531120928660d5339fe60b42

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT_P.zip
- Size
- 2.31 GB
- Format
- application/zip
- Description
- Zip
- MD5
- b89110d4c0bf2af17aee5ee1c971832c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- LINE_T-1.zip
- Size
- 6.77 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 52182180ae95044e2e7fd343bdaf4cbf

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- LINE_T-2.zip
- Size
- 5.19 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 1ed6f84b49821847afd7b1403d75da1c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- PHOTO_L.zip
- Size
- 8.06 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 48638fd039a6832cf2aed9ef4de954c6

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT-1.zip
- Size
- 8.68 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 15a2498977f152526304605d00151131

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT-2.zip
- Size
- 8.68 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 0ebe7899f037a2751c7e113dfb275ff3

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT-3.zip
- Size
- 8.55 GB
- Format
- application/zip
- Description
- Zip
- MD5
- e1e5d42ce6e732a769a64dc65444e1dc

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- TEXT-3.zip
- Size
- 8.55 GB
- Format
- application/zip
- Description
- Zip
- MD5
- e1e5d42ce6e732a769a64dc65444e1dc

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

