Show simple item record

 
dc.contributor.author Novotný, Vít
dc.contributor.author Luger, Kristýna
dc.contributor.author Štefánik, Michal
dc.contributor.author Vrabcová, Tereza
dc.contributor.author Horák, Aleš
dc.date.accessioned 2023-01-23T20:43:53Z
dc.date.available 2023-01-23T20:43:53Z
dc.date.issued 2022-11-30
dc.identifier.uri http://hdl.handle.net/11234/1-5024
dc.description This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
dc.language.iso ces
dc.language.iso eng
dc.language.iso deu
dc.language.iso lat
dc.publisher Masaryk University, Brno
dc.relation.isreferencedby https://nlp.fi.muni.cz/projects/ahisto/ner-dataset
dc.relation.replaces http://hdl.handle.net/11234/1-4936
dc.rights Public Domain Dedication (CC Zero)
dc.rights.uri http://creativecommons.org/publicdomain/zero/1.0/
dc.source.uri https://starfos.tacr.cz/en/project/TL03000365
dc.subject NER
dc.subject named entity recognition
dc.subject Medieval
dc.title A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Vít Novotný witiko@mail.muni.cz Masaryk University, Brno
sponsor TAČR TL03000365 Accessible historical sources. Making medieval written documents available in the form of a contextual database nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
size.info 205 files
size.info 2.89 gb
files.size 3096251769
files.count 3


 Files in this item

This item is
Publicly Available
and licensed under:
Public Domain Dedication (CC Zero)
Distributed under Creative Commons No Copyright
Icon
Name
language-modeling-corpus.zip
Size
633.79 MB
Format
application/zip
Description
Sentences for unsupervised training and validation of language models
MD5
b6ed0a8e7dc263d1a1e635d6d5770f6b
 Download file  Preview
 File Preview  
    • dataset_mlm_non-crossing_only-relevant_training.txt6 MB
    • dataset_mlm_non-crossing_all_validation.txt64 MB
    • dataset_mlm_all_only-relevant_validation.txt719 kB
    • dataset_mlm_all_only-relevant_training.txt7 MB
    • dataset_mlm_non-crossing_all_training.txt499 MB
    • dataset_mlm_all_all_validation.txt78 MB
    • dataset_mlm_non-crossing_only-relevant_validation.txt536 kB
    • dataset_mlm_all_all_training.txt601 MB
Icon
Name
named-entity-recognition-annotations-small.zip
Size
978.29 MB
Format
application/zip
Description
Sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models
MD5
92d80ec8d6e66263295797b1e81bd60d
 Download file  Preview
 File Preview  
    • dataset_ner_manatee_non-crossing_only-relevant_testing_001-400.ner_tags.txt41 kB
    • dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged.sentences.txt628 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged.ner_tags.txt10 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged.sentences.txt3 MB
    • dataset_ner_regests_testing.sentences.txt186 kB
    • dataset_ner_fuzzy-regex+regests_all_all_training.ner_tags.txt50 MB
    • dataset_ner_manatee_non-crossing_all_validation.ner_tags.txt2 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt317 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged.ner_tags.txt10 MB
    • dataset_ner_manatee+regests_non-crossing_all_training.sentences.txt43 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged.sentences.txt1 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training.sentences.txt4 MB
    • dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged.sentences.txt3 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training.ner_tags.txt546 kB
    • dataset_ner_manatee_non-crossing_all_validation_automatically_tagged.ner_tags.txt3 MB
    • dataset_ner_manatee+regests_all_all_training_automatically_tagged.sentences.txt64 MB
    • dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged.ner_tags.txt63 MB
    • dataset_ner_manatee+regests_all_all_training.sentences.txt64 MB
    • dataset_ner_manatee_all_all_validation.sentences.txt14 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged.sentences.txt908 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training.ner_tags.txt956 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged.ner_tags.txt394 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation.ner_tags.txt347 kB
    • dataset_ner_fuzzy-regex+regests_all_all_training.sentences.txt156 MB
    • dataset_ner_regests_testing_001-400.sentences.txt91 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_testing.ner_tags.txt12 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing.ner_tags.txt114 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_testing.ner_tags.txt8 MB
    • dataset_ner_regests_testing.ner_tags.txt69 kB
    • dataset_ner_manatee_non-crossing_all_validation_automatically_tagged.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_all_training.ner_tags.txt13 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training.sentences.txt3 MB
    • dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged.sentences.txt39 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500.ner_tags.txt9 kB
    • dataset_ner_manatee+regests_all_all_validation_automatically_tagged.ner_tags.txt5 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_training.sentences.txt3 MB
    • dataset_ner_manatee_all_only-relevant_testing.ner_tags.txt164 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.docx58 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_training.ner_tags.txt36 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt725 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_manatee_all_all_training_automatically_tagged.ner_tags.txt26 MB
    • dataset_ner_manatee+regests_all_all_validation_automatically_tagged.sentences.txt14 MB
    • dataset_ner_manatee_all_only-relevant_validation_automatically_tagged.ner_tags.txt212 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_testing.ner_tags.txt308 kB
    • dataset_ner_manatee+regests_non-crossing_all_training.ner_tags.txt13 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged.ner_tags.txt46 MB
    • dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged.sentences.txt43 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.ner_tags.txt12 kB
    • dataset_ner_manatee+regests_non-crossing_all_validation.ner_tags.txt2 MB
    • dataset_ner_regests_validation.ner_tags.txt49 kB
    • dataset_ner_manatee_all_only-relevant_training_automatically_tagged.ner_tags.txt899 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged.sentences.txt25 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation.sentences.txt908 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation.sentences.txt25 MB
    • dataset_ner_manatee_all_all_testing.sentences.txt14 MB
    • dataset_ner_fuzzy-regex_all_all_training.ner_tags.txt51 MB
    • dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged.ner_tags.txt16 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt266 kB
    • dataset_ner_manatee_all_only-relevant_validation.sentences.txt500 kB
    • dataset_ner_manatee_all_only-relevant_training.sentences.txt2 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing.sentences.txt343 kB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt597 kB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation.ner_tags.txt201 kB
    • dataset_ner_fuzzy-regex+regests_all_all_validation.sentences.txt39 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged.sentences.txt4 MB
    • dataset_ner_manatee_non-crossing_all_testing.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.ner_tags.txt11 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged.sentences.txt3 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation.sentences.txt340 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_testing.sentences.txt25 MB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged.ner_tags.txt445 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation.ner_tags.txt162 kB
    • dataset_ner_fuzzy-regex_all_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_manatee_all_only-relevant_training_automatically_tagged.sentences.txt2 MB
    • dataset_ner_regests_validation.sentences.txt128 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training.sentences.txt110 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation.sentences.txt598 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation.sentences.txt1 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training.sentences.txt1 MB
    • dataset_ner_manatee_all_all_training.sentences.txt63 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training.ner_tags.txt897 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_testing.sentences.txt39 MB
    • dataset_ner_manatee+regests_all_all_validation.sentences.txt14 MB
    • dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt679 kB
    • dataset_ner_manatee_all_only-relevant_training.ner_tags.txt731 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation.sentences.txt725 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.sentences.txt28 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500.sentences.txt28 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_001-400.sentences.txt123 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training.ner_tags.txt36 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged.sentences.txt109 MB
    • dataset_ner_manatee_all_only-relevant_validation.ner_tags.txt164 kB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation.ner_tags.txt249 kB
    • dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged.sentences.txt1 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training.sentences.txt2 MB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.sentences.txt28 kB
    • dataset_ner_regests_training.sentences.txt1 MB
    • dataset_ner_regests_training_automatically_tagged.sentences.txt1 MB
    • dataset_ner_regests_testing_001-400.ner_tags.txt34 kB
    • dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_validation.sentences.txt39 MB
    • dataset_ner_fuzzy-regex+regests_all_all_validation.ner_tags.txt12 MB
    • dataset_ner_manatee+regests_all_only-relevant_validation.ner_tags.txt213 kB
    • dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged.sentences.txt8 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_testing.sentences.txt928 kB
    • dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt147 kB
    • dataset_ner_manatee+regests_all_all_training.ner_tags.txt20 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged.sentences.txt2 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing.ner_tags.txt214 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_training.sentences.txt109 MB
    • dataset_ner_manatee+regests_all_only-relevant_training.sentences.txt3 MB
    • dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged.ner_tags.txt3 MB
    • dataset_ner_manatee_all_all_testing.ner_tags.txt4 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation.sentences.txt468 kB
    • dataset_ner_manatee_all_all_validation_automatically_tagged.ner_tags.txt5 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt198 kB
    • dataset_ner_manatee+regests_all_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_validation.ner_tags.txt8 MB
    • dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged.ner_tags.txt263 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged.ner_tags.txt45 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing.sentences.txt632 kB
    • dataset_ner_regests_validation_automatically_tagged.sentences.txt128 kB
    • dataset_ner_regests_validation_automatically_tagged.ner_tags.txt50 kB
    • dataset_ner_manatee+regests_all_all_validation.ner_tags.txt4 MB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged.sentences.txt2 MB
    • dataset_ner_manatee_all_only-relevant_testing.sentences.txt498 kB
    • dataset_ner_manatee_non-crossing_all_validation.sentences.txt8 MB
    • dataset_ner_manatee_non-crossing_all_training.sentences.txt42 MB
    • dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged.ner_tags.txt18 MB
    • dataset_ner_manatee_all_only-relevant_validation_automatically_tagged.sentences.txt500 kB
    • dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx39 kB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation.ner_tags.txt8 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged.sentences.txt39 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_validation.ner_tags.txt299 kB
    • dataset_ner_manatee_all_all_training_automatically_tagged.sentences.txt63 MB
    • dataset_ner_regests_training.ner_tags.txt411 kB
    • dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt468 kB
    • dataset_ner_manatee_non-crossing_all_testing.ner_tags.txt2 MB
    • dataset_ner_manatee+regests_non-crossing_all_validation.sentences.txt8 MB
    • dataset_ner_manatee+regests_all_all_training_automatically_tagged.ner_tags.txt26 MB
    • dataset_ner_fuzzy-regex_all_all_training_automatically_tagged.sentences.txt155 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation.ner_tags.txt114 kB
    • dataset_ner_manatee+regests_all_only-relevant_validation.sentences.txt628 kB
    • dataset_ner_fuzzy-regex_all_all_training_automatically_tagged.ner_tags.txt63 MB
    • dataset_ner_manatee_non-crossing_all_training_automatically_tagged.ner_tags.txt17 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged.ner_tags.txt16 MB
    • dataset_ner_fuzzy-regex_all_all_training.sentences.txt156 MB
    • dataset_ner_regests_training_automatically_tagged.ner_tags.txt424 kB
    • dataset_ner_manatee_non-crossing_all_training_automatically_tagged.sentences.txt42 MB
    • dataset_ner_manatee_all_all_validation.ner_tags.txt4 MB
    • dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged.sentences.txt110 MB
    • dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged.sentences.txt156 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged.sentences.txt25 MB
    • dataset_ner_fuzzy-regex_non-crossing_all_validation.sentences.txt25 MB
    • dataset_ner_fuzzy-regex_all_all_validation.ner_tags.txt12 MB
    • dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
    • dataset_ner_fuzzy-regex_non-crossing_only-relevant_training.sentences.txt2 MB
    • dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt340 kB
    • dataset_ner_manatee_all_all_validation_automatically_tagged.sentences.txt14 MB
    • dataset_ner_manatee_all_all_training.ner_tags.txt20 MB
Icon
Name
named-entity-recognition-annotations-large.zip
Size
1.31 GB
Format
application/zip
Description
Sentences and NER tags for supervised training, validation, and testing of language models
MD5
4bf38b89ed8948d3fa355b6e1e55d6de
 Download file  Preview
 File Preview  
    • dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004.sentences.txt534 kB
    • dataset_mlm_all_only-relevant_training_automatically_tagged_004.sentences.txt7 MB
    • dataset_mlm_non-crossing_all_training_automatically_tagged_004.sentences.txt498 MB
    • dataset_mlm_all_all_validation_automatically_tagged_004.sentences.txt77 MB
    • dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007.sentences.txt6 MB
    • dataset_mlm_all_all_training_automatically_tagged_004.sentences.txt599 MB
    • dataset_mlm_all_only-relevant_validation_automatically_tagged_004.ner_tags.txt265 kB
    • dataset_mlm_all_only-relevant_validation_automatically_tagged_004.sentences.txt717 kB
    • dataset_mlm_all_only-relevant_validation_automatically_tagged_007.ner_tags.txt248 kB
    • dataset_mlm_all_all_validation_automatically_tagged_007.ner_tags.txt28 MB
    • dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004.ner_tags.txt2 MB
    • dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007.ner_tags.txt2 MB
    • dataset_mlm_non-crossing_all_validation_automatically_tagged_007.sentences.txt64 MB
    • dataset_mlm_non-crossing_all_training_automatically_tagged_004.ner_tags.txt203 MB
    • dataset_mlm_non-crossing_all_training_automatically_tagged_007.ner_tags.txt184 MB
    • dataset_mlm_all_only-relevant_training_automatically_tagged_004.ner_tags.txt3 MB
    • dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004.sentences.txt6 MB
    • dataset_mlm_all_only-relevant_training_automatically_tagged_007.ner_tags.txt3 MB
    • dataset_mlm_all_all_training_automatically_tagged_004.ner_tags.txt241 MB
    • dataset_mlm_all_all_training_automatically_tagged_007.ner_tags.txt220 MB
    • dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004.ner_tags.txt202 kB
    • dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007.ner_tags.txt188 kB
    • dataset_mlm_non-crossing_all_validation_automatically_tagged_004.sentences.txt64 MB
    • dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007.sentences.txt534 kB
    • dataset_mlm_all_only-relevant_training_automatically_tagged_007.sentences.txt7 MB
    • dataset_mlm_non-crossing_all_training_automatically_tagged_007.sentences.txt498 MB
    • dataset_mlm_all_all_validation_automatically_tagged_004.ner_tags.txt30 MB
    • dataset_mlm_all_all_validation_automatically_tagged_007.sentences.txt77 MB
    • dataset_mlm_all_all_training_automatically_tagged_007.sentences.txt599 MB
    • dataset_mlm_all_only-relevant_validation_automatically_tagged_007.sentences.txt717 kB
    • dataset_mlm_non-crossing_all_validation_automatically_tagged_004.ner_tags.txt25 MB
    • dataset_mlm_non-crossing_all_validation_automatically_tagged_007.ner_tags.txt23 MB

Show simple item record