This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

Please use the following text to cite this item or export to a predefined format:
Novotný, Vít; Luger, Kristýna; Štefánik, Michal; Vrabcová, Tereza and Horák, Aleš, 2022, A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-4936.
Date issued
2022-11-30
Size
173 files,
1.58 gb
Description
This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Acknowledgement

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
2022-11-30 00:00:00
1*
2022-11-30 00:00:00
* Selected version
This item isPublicly Available
and licensed under:
 Files in this item
Name
language-modeling-corpus.zip
Size
633.79 MB
Format
application/zip
Description
Zip
MD5
b6ed0a8e7dc263d1a1e635d6d5770f6b
Preview
  File Preview
    • dataset_mlm_non-crossing_only-relevant_training.txt6 MB
    • dataset_mlm_non-crossing_all_validation.txt64 MB
    • dataset_mlm_all_only-relevant_training.txt7 MB
    • dataset_mlm_all_only-relevant_validation.txt719 kB
    • dataset_mlm_all_all_validation.txt78 MB
    • dataset_mlm_non-crossing_all_training.txt499 MB
    • dataset_mlm_non-crossing_only-relevant_validation.txt536 kB
    • dataset_mlm_all_all_training.txt601 MB
Name
named-entity-recognition-annotations.zip
Size
978.29 MB
Format
application/zip
Description
Zip
MD5
92d80ec8d6e66263295797b1e81bd60d
Preview
  File Preview
    • dataset_ner_manatee_non-crossing_only-relevant_testing_001-400.ner_tags.txt41 kB
    • dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged.sentences.txt628 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged.ner_tags.txt10 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged.sentences.txt3 MB
    • dataset_ner_regests_testing.sentences.txt186 kB
    • dataset_ner_fuzzy-regex+regests_all_all_training.ner_tags.txt50 MB
    • dataset_ner_manatee_non-crossing_all_validation.ner_tags.txt2 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt317 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged.ner_tags.txt10 MB
    • dataset_ner_manatee+regests_non-crossing_all_training.sentences.txt43 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged.sentences.txt1 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training.sentences.txt4 MB
    • dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged.sentences.txt3 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training.ner_tags.txt546 kB
    • dataset_ner_manatee_non-crossing_all_validation_automatically_tagged.ner_tags.txt3 MB
    • dataset_ner_manatee+regests_all_all_training_automatically_tagged.sentences.txt64 MB
    • dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged.ner_tags.txt63 MB
    • dataset_ner_manatee+regests_all_all_training.sentences.txt64 MB
    • dataset_ner_manatee_all_all_validation.sentences.txt14 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged.sentences.txt908 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training.ner_tags.txt956 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged.ner_tags.txt394 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation.ner_tags.txt347 kB
    • dataset_ner_fuzzy-regex+regests_all_all_training.sentences.txt156 MB
    • dataset_ner_regests_testing_001-400.sentences.txt91 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_testing.ner_tags.txt12 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing.ner_tags.txt114 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_testing.ner_tags.txt8 MB
    • dataset_ner_regests_testing.ner_tags.txt69 kB
    • dataset_ner_manatee_non-crossing_all_validation_automatically_tagged.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_all_training.ner_tags.txt13 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training.sentences.txt3 MB
    • dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged.sentences.txt39 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500.ner_tags.txt9 kB
    • dataset_ner_manatee+regests_all_all_validation_automatically_tagged.ner_tags.txt5 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_training.sentences.txt3 MB
    • dataset_ner_manatee_all_only-relevant_testing.ner_tags.txt164 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.docx58 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_training.ner_tags.txt36 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt725 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_manatee_all_all_training_automatically_tagged.ner_tags.txt26 MB
    • dataset_ner_manatee+regests_all_all_validation_automatically_tagged.sentences.txt14 MB
    • dataset_ner_manatee_all_only-relevant_validation_automatically_tagged.ner_tags.txt212 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_testing.ner_tags.txt308 kB
    • dataset_ner_manatee+regests_non-crossing_all_training.ner_tags.txt13 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged.ner_tags.txt46 MB
    • dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged.sentences.txt43 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.ner_tags.txt12 kB
    • dataset_ner_manatee+regests_non-crossing_all_validation.ner_tags.txt2 MB
    • dataset_ner_regests_validation.ner_tags.txt49 kB
    • dataset_ner_manatee_all_only-relevant_training_automatically_tagged.ner_tags.txt899 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged.sentences.txt25 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation.sentences.txt908 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation.sentences.txt25 MB
    • dataset_ner_manatee_all_all_testing.sentences.txt14 MB
    • dataset_ner_fuzzy-regex_all_all_training.ner_tags.txt51 MB
    • dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged.ner_tags.txt16 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt266 kB
    • dataset_ner_manatee_all_only-relevant_validation.sentences.txt500 kB
    • dataset_ner_manatee_all_only-relevant_training.sentences.txt2 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing.sentences.txt343 kB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt597 kB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation.ner_tags.txt201 kB
    • dataset_ner_fuzzy-regex+regests_all_all_validation.sentences.txt39 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged.sentences.txt4 MB
    • dataset_ner_manatee_non-crossing_all_testing.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.ner_tags.txt11 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged.sentences.txt3 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation.sentences.txt340 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_testing.sentences.txt25 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged.ner_tags.txt445 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation.ner_tags.txt162 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_manatee_all_only-relevant_training_automatically_tagged.sentences.txt2 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training.sentences.txt110 MB
    • dataset_ner_regests_validation.sentences.txt128 kB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation.sentences.txt598 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation.sentences.txt1 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training.sentences.txt1 MB
    • dataset_ner_manatee_all_all_training.sentences.txt63 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training.ner_tags.txt897 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_testing.sentences.txt39 MB
    • dataset_ner_manatee+regests_all_all_validation.sentences.txt14 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt679 kB
    • dataset_ner_manatee_all_only-relevant_training.ner_tags.txt731 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation.sentences.txt725 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.sentences.txt28 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500.sentences.txt28 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_001-400.sentences.txt123 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training.ner_tags.txt36 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged.sentences.txt109 MB
    • dataset_ner_manatee_all_only-relevant_validation.ner_tags.txt164 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation.ner_tags.txt249 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged.sentences.txt1 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training.sentences.txt2 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.sentences.txt28 kB
    • dataset_ner_regests_training_automatically_tagged.sentences.txt1 MB
    • dataset_ner_regests_testing_001-400.ner_tags.txt34 kB
    • dataset_ner_regests_training.sentences.txt1 MB
    • dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_validation.sentences.txt39 MB
    • dataset_ner_fuzzy-regex+regests_all_all_validation.ner_tags.txt12 MB
    • dataset_ner_manatee+regests_all_only-relevant_validation.ner_tags.txt213 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_testing.sentences.txt928 kB
    • dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt147 kB
    • dataset_ner_manatee+regests_all_all_training.ner_tags.txt20 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged.sentences.txt2 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing.ner_tags.txt214 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_training.sentences.txt109 MB
    • dataset_ner_manatee+regests_all_only-relevant_training.sentences.txt3 MB
    • dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged.ner_tags.txt3 MB
    • dataset_ner_manatee_all_all_testing.ner_tags.txt4 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation.sentences.txt468 kB
    • dataset_ner_manatee_all_all_validation_automatically_tagged.ner_tags.txt5 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt198 kB
    • dataset_ner_manatee+regests_all_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged.ner_tags.txt263 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation.ner_tags.txt8 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged.ner_tags.txt45 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing.sentences.txt632 kB
    • dataset_ner_regests_validation_automatically_tagged.ner_tags.txt50 kB
    • dataset_ner_regests_validation_automatically_tagged.sentences.txt128 kB
    • dataset_ner_manatee+regests_all_all_validation.ner_tags.txt4 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged.sentences.txt2 MB
    • dataset_ner_manatee_all_only-relevant_testing.sentences.txt498 kB
    • dataset_ner_manatee_non-crossing_all_validation.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_all_training.sentences.txt42 MB
    • dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged.ner_tags.txt18 MB
    • dataset_ner_manatee_all_only-relevant_validation_automatically_tagged.sentences.txt500 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx39 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation.ner_tags.txt8 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged.sentences.txt39 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation.ner_tags.txt299 kB
    • dataset_ner_manatee_all_all_training_automatically_tagged.sentences.txt63 MB
    • dataset_ner_regests_training.ner_tags.txt411 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt468 kB
    • dataset_ner_manatee_non-crossing_all_testing.ner_tags.txt2 MB
    • dataset_ner_manatee+regests_non-crossing_all_validation.sentences.txt8 MB
    • dataset_ner_manatee+regests_all_all_training_automatically_tagged.ner_tags.txt26 MB
    • dataset_ner_fuzzy-regex_all_all_training_automatically_tagged.sentences.txt155 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation.ner_tags.txt114 kB
    • dataset_ner_manatee+regests_all_only-relevant_validation.sentences.txt628 kB
    • dataset_ner_fuzzy-regex_all_all_training_automatically_tagged.ner_tags.txt63 MB
    • dataset_ner_manatee_non-crossing_all_training_automatically_tagged.ner_tags.txt17 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged.ner_tags.txt16 MB
    • dataset_ner_fuzzy-regex_all_all_training.sentences.txt156 MB
    • dataset_ner_manatee_non-crossing_all_training_automatically_tagged.sentences.txt42 MB
    • dataset_ner_regests_training_automatically_tagged.ner_tags.txt424 kB
    • dataset_ner_manatee_all_all_validation.ner_tags.txt4 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged.sentences.txt110 MB
    • dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged.sentences.txt156 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged.sentences.txt25 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation.sentences.txt25 MB
    • dataset_ner_fuzzy-regex_all_all_validation.ner_tags.txt12 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training.sentences.txt2 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt340 kB
    • dataset_ner_manatee_all_all_validation_automatically_tagged.sentences.txt14 MB
    • dataset_ner_manatee_all_all_training.ner_tags.txt20 MB