This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

Content-based annotation of page images from the (archaeological) historical archive

Please use the following text to cite this item or export to a predefined format:
Lutsai,Kateryna and Křivánková,Dana, 2025, Content-based annotation of page images from the (archaeological) historical archive, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/20.500.12800/1-5959.
Date issued
2025-10-10
Size
91 Gb,
48499 images,
11 categories
Description
This dataset employs a comprehensive 11-label classification scheme to categorize scanned images of document pages. The types are based on their content and presentation format. The scheme distinguishes between visual content (drawings, maps, paintings, schematics, and photographs), textual content (handwritten, printed, or machine-typed), and hybrid formats that combine multiple elements. Special attention is given to layout characteristics, with separate labels designated for content presented in tabular or form-like structures versus paragraph or block formats. For instance, we differentiate between standard drawings (DRAW📈) and drawings with table-based legends (DRAW_L📈📏), as well as between regular photographs (PHOTO🌄) and those embedded within tabular layouts (PHOTO_L🌄📏). The textual categories are particularly nuanced, distinguishing between three input methods—handwritten (✏️), printed (📄), and machine-typed (📄)—and further subdividing these based on structural organization. Text can appear in either tabular/form-like arrangements (LINE_HW, LINE_P, LINE_T) or in traditional paragraph/block formats (TEXT_HW, TEXT_P, TEXT_T). An additional TEXT📰 category accommodates mixed documents that combine multiple text types or include minor graphical elements, providing flexibility for complex real-world documents. The dataset is organized using a 5-fold cross-validation structure, with each fold maintaining an 80-10-10 split for training, development, and test sets respectively. This partitioning information is documented in an accompanying CSV file, enabling robust model evaluation and the potential for ensemble approaches where models trained on different folds can be averaged together to create a more robust combined model, provided they share the same base architecture.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
DATASET_FOLDS_ANNOTATION.csv
Size
2.32 MB
Format
text/csv
Description
CSV
MD5
de0ce1f9da169a8fed47a34e835e952f
Preview
  File Preview
Name
stacked_timeline_graph.png
Size
183.01 KB
Format
image/png
Description
PNG
MD5
b2151ac48e06573e0454adb49ca108b8
Preview
  File Preview
Name
DRAW.zip
Size
8.56 GB
Format
application/zip
Description
Zip
MD5
13c11b982458c04e081d600c5bd54d60
Preview
  File Preview
Name
DRAW_L.zip
Size
4.87 GB
Format
application/zip
Description
Zip
MD5
f0a63ee10767b404e3c03c8556614b64
Preview
  File Preview
Name
LINE_HW.zip
Size
5.18 GB
Format
application/zip
Description
Zip
MD5
ca833a3012ceab5db167b0a2910382f4
Preview
  File Preview
Name
LINE_P.zip
Size
1.46 GB
Format
application/zip
Description
Zip
MD5
2e3afc3d97e1917a91c3484e48aa6514
Preview
  File Preview
Name
README.md
Size
5.32 KB
Format
application/octet-stream
Description
Unknown
MD5
f36452da35f38f7e8f814d015e8236c7
Preview
  File Preview
Name
PHOTO.zip
Size
7.3 GB
Format
application/zip
Description
Zip
MD5
8c8cc51c29a7c8e955e3838c8d339a55
Preview
  File Preview
Name
TEXT_HW.zip
Size
4.34 GB
Format
application/zip
Description
Zip
MD5
64ab17485a0b55bec7561586ee384b9f
Preview
  File Preview
Name
TEXT_T.zip
Size
5.3 GB
Format
application/zip
Description
Zip
MD5
840daff9531120928660d5339fe60b42
Preview
  File Preview
Name
TEXT_P.zip
Size
2.31 GB
Format
application/zip
Description
Zip
MD5
b89110d4c0bf2af17aee5ee1c971832c
Preview
  File Preview
Name
LINE_T-1.zip
Size
6.77 GB
Format
application/zip
Description
Zip
MD5
52182180ae95044e2e7fd343bdaf4cbf
Preview
  File Preview
Name
LINE_T-2.zip
Size
5.19 GB
Format
application/zip
Description
Zip
MD5
1ed6f84b49821847afd7b1403d75da1c
Preview
  File Preview
Name
PHOTO_L.zip
Size
8.06 GB
Format
application/zip
Description
Zip
MD5
48638fd039a6832cf2aed9ef4de954c6
Preview
  File Preview
Name
TEXT-1.zip
Size
8.68 GB
Format
application/zip
Description
Zip
MD5
15a2498977f152526304605d00151131
Preview
  File Preview
Name
TEXT-2.zip
Size
8.68 GB
Format
application/zip
Description
Zip
MD5
0ebe7899f037a2751c7e113dfb275ff3
Preview
  File Preview
Name
TEXT-3.zip
Size
8.55 GB
Format
application/zip
Description
Zip
MD5
e1e5d42ce6e732a769a64dc65444e1dc
Preview
  File Preview
Name
TEXT-3.zip
Size
8.55 GB
Format
application/zip
Description
Zip
MD5
e1e5d42ce6e732a769a64dc65444e1dc
Preview
  File Preview