dc.contributor.author | Novotný, Vít |
dc.contributor.author | Luger, Kristýna |
dc.contributor.author | Štefánik, Michal |
dc.contributor.author | Vrabcová, Tereza |
dc.contributor.author | Horák, Aleš |
dc.date.accessioned | 2023-01-23T20:43:53Z |
dc.date.available | 2023-01-23T20:43:53Z |
dc.date.issued | 2022-11-30 |
dc.identifier.uri | http://hdl.handle.net/11234/1-5024 |
dc.description | This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER). |
dc.language.iso | ces |
dc.language.iso | eng |
dc.language.iso | deu |
dc.language.iso | lat |
dc.publisher | Masaryk University, Brno |
dc.relation.isreferencedby | https://nlp.fi.muni.cz/projects/ahisto/ner-dataset |
dc.relation.replaces | http://hdl.handle.net/11234/1-4936 |
dc.rights | Public Domain Dedication (CC Zero) |
dc.rights.uri | http://creativecommons.org/publicdomain/zero/1.0/ |
dc.source.uri | https://starfos.tacr.cz/en/project/TL03000365 |
dc.subject | NER |
dc.subject | named entity recognition |
dc.subject | Medieval |
dc.title | A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Vít Novotný witiko@mail.muni.cz Masaryk University, Brno |
sponsor | TAČR TL03000365 Accessible historical sources. Making medieval written documents available in the form of a contextual database nationalFunds |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds |
size.info | 205 files |
size.info | 2.89 gb |
files.size | 3096251769 |
files.count | 3 |
Soubory tohoto záznamu
- Název
- language-modeling-corpus.zip
- Velikost
- 633.79 MB
- Formát
- application/zip
- Popis
- Sentences for unsupervised training and validation of language models
- MD5
- b6ed0a8e7dc263d1a1e635d6d5770f6b
- dataset_mlm_non-crossing_only-relevant_training.txt6 MB
- dataset_mlm_non-crossing_all_validation.txt64 MB
- dataset_mlm_all_only-relevant_validation.txt719 kB
- dataset_mlm_all_only-relevant_training.txt7 MB
- dataset_mlm_non-crossing_all_training.txt499 MB
- dataset_mlm_all_all_validation.txt78 MB
- dataset_mlm_non-crossing_only-relevant_validation.txt536 kB
- dataset_mlm_all_all_training.txt601 MB
- Název
- named-entity-recognition-annotations-small.zip
- Velikost
- 978.29 MB
- Formát
- application/zip
- Popis
- Sentences and NER tags that we used for the supervised training, validation, and testing of intermediate language models
- MD5
- 92d80ec8d6e66263295797b1e81bd60d
- dataset_ner_manatee_non-crossing_only-relevant_testing_001-400.ner_tags.txt41 kB
- dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged.sentences.txt628 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged.ner_tags.txt10 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged.sentences.txt3 MB
- dataset_ner_regests_testing.sentences.txt186 kB
- dataset_ner_fuzzy-regex+regests_all_all_training.ner_tags.txt50 MB
- dataset_ner_manatee_non-crossing_all_validation.ner_tags.txt2 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt317 kB
- dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged.ner_tags.txt10 MB
- dataset_ner_manatee+regests_non-crossing_all_training.sentences.txt43 MB
- dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged.sentences.txt1 MB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_training.sentences.txt4 MB
- dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged.sentences.txt3 MB
- dataset_ner_manatee_non-crossing_only-relevant_training.ner_tags.txt546 kB
- dataset_ner_manatee_non-crossing_all_validation_automatically_tagged.ner_tags.txt3 MB
- dataset_ner_manatee+regests_all_all_training_automatically_tagged.sentences.txt64 MB
- dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged.ner_tags.txt63 MB
- dataset_ner_manatee+regests_all_all_training.sentences.txt64 MB
- dataset_ner_manatee_all_all_validation.sentences.txt14 MB
- dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged.sentences.txt908 kB
- dataset_ner_manatee+regests_non-crossing_only-relevant_training.ner_tags.txt956 kB
- dataset_ner_fuzzy-regex_all_only-relevant_validation_automatically_tagged.ner_tags.txt394 kB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_validation.ner_tags.txt347 kB
- dataset_ner_fuzzy-regex+regests_all_all_training.sentences.txt156 MB
- dataset_ner_regests_testing_001-400.sentences.txt91 kB
- dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex_all_all_testing.ner_tags.txt12 MB
- dataset_ner_manatee_non-crossing_only-relevant_testing.ner_tags.txt114 kB
- dataset_ner_fuzzy-regex_non-crossing_all_testing.ner_tags.txt8 MB
- dataset_ner_regests_testing.ner_tags.txt69 kB
- dataset_ner_manatee_non-crossing_all_validation_automatically_tagged.sentences.txt8 MB
- dataset_ner_manatee_non-crossing_all_training.ner_tags.txt13 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training.sentences.txt3 MB
- dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged.sentences.txt39 MB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500.ner_tags.txt9 kB
- dataset_ner_manatee+regests_all_all_validation_automatically_tagged.ner_tags.txt5 MB
- dataset_ner_fuzzy-regex_all_only-relevant_training.sentences.txt3 MB
- dataset_ner_manatee_all_only-relevant_testing.ner_tags.txt164 kB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.docx58 kB
- dataset_ner_fuzzy-regex_non-crossing_all_training.ner_tags.txt36 MB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt725 kB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_training.ner_tags.txt1 MB
- dataset_ner_manatee_all_all_training_automatically_tagged.ner_tags.txt26 MB
- dataset_ner_manatee+regests_all_all_validation_automatically_tagged.sentences.txt14 MB
- dataset_ner_manatee_all_only-relevant_validation_automatically_tagged.ner_tags.txt212 kB
- dataset_ner_fuzzy-regex_all_only-relevant_testing.ner_tags.txt308 kB
- dataset_ner_manatee+regests_non-crossing_all_training.ner_tags.txt13 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged.ner_tags.txt46 MB
- dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged.sentences.txt43 MB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.ner_tags.txt12 kB
- dataset_ner_manatee+regests_non-crossing_all_validation.ner_tags.txt2 MB
- dataset_ner_regests_validation.ner_tags.txt49 kB
- dataset_ner_manatee_all_only-relevant_training_automatically_tagged.ner_tags.txt899 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_validation_automatically_tagged.sentences.txt25 MB
- dataset_ner_fuzzy-regex_all_only-relevant_validation.sentences.txt908 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_validation.sentences.txt25 MB
- dataset_ner_manatee_all_all_testing.sentences.txt14 MB
- dataset_ner_fuzzy-regex_all_all_training.ner_tags.txt51 MB
- dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged.ner_tags.txt16 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt266 kB
- dataset_ner_manatee_all_only-relevant_validation.sentences.txt500 kB
- dataset_ner_manatee_all_only-relevant_training.sentences.txt2 MB
- dataset_ner_manatee_non-crossing_only-relevant_testing.sentences.txt343 kB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt597 kB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation.ner_tags.txt201 kB
- dataset_ner_fuzzy-regex+regests_all_all_validation.sentences.txt39 MB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_training_automatically_tagged.sentences.txt4 MB
- dataset_ner_manatee_non-crossing_all_testing.sentences.txt8 MB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.ner_tags.txt11 kB
- dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged.sentences.txt3 MB
- dataset_ner_manatee_non-crossing_only-relevant_validation.sentences.txt340 kB
- dataset_ner_fuzzy-regex_non-crossing_all_testing.sentences.txt25 MB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged.ner_tags.txt445 kB
- dataset_ner_manatee+regests_non-crossing_only-relevant_validation.ner_tags.txt162 kB
- dataset_ner_fuzzy-regex_all_only-relevant_training.ner_tags.txt1 MB
- dataset_ner_manatee_all_only-relevant_training_automatically_tagged.sentences.txt2 MB
- dataset_ner_regests_validation.sentences.txt128 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_training.sentences.txt110 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_validation.sentences.txt598 kB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_validation.sentences.txt1 MB
- dataset_ner_manatee_non-crossing_only-relevant_training.sentences.txt1 MB
- dataset_ner_manatee_all_all_training.sentences.txt63 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_training.ner_tags.txt897 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex_all_all_testing.sentences.txt39 MB
- dataset_ner_manatee+regests_all_all_validation.sentences.txt14 MB
- dataset_ner_manatee_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt679 kB
- dataset_ner_manatee_all_only-relevant_training.ner_tags.txt731 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation.sentences.txt725 kB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.sentences.txt28 kB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500.sentences.txt28 kB
- dataset_ner_manatee_non-crossing_only-relevant_testing_001-400.sentences.txt123 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_training.ner_tags.txt36 MB
- dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged.sentences.txt109 MB
- dataset_ner_manatee_all_only-relevant_validation.ner_tags.txt164 kB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_validation.ner_tags.txt249 kB
- dataset_ner_fuzzy-regex+regests_all_only-relevant_validation_automatically_tagged.sentences.txt1 MB
- dataset_ner_manatee+regests_non-crossing_only-relevant_training.sentences.txt2 MB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_automatically_tagged.sentences.txt28 kB
- dataset_ner_regests_training.sentences.txt1 MB
- dataset_ner_regests_training_automatically_tagged.sentences.txt1 MB
- dataset_ner_regests_testing_001-400.ner_tags.txt34 kB
- dataset_ner_manatee+regests_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex_all_all_validation.sentences.txt39 MB
- dataset_ner_fuzzy-regex+regests_all_all_validation.ner_tags.txt12 MB
- dataset_ner_manatee+regests_all_only-relevant_validation.ner_tags.txt213 kB
- dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged.sentences.txt8 MB
- dataset_ner_fuzzy-regex_all_only-relevant_testing.sentences.txt928 kB
- dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt147 kB
- dataset_ner_manatee+regests_all_all_training.ner_tags.txt20 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged.sentences.txt2 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing.ner_tags.txt214 kB
- dataset_ner_fuzzy-regex_non-crossing_all_training.sentences.txt109 MB
- dataset_ner_manatee+regests_all_only-relevant_training.sentences.txt3 MB
- dataset_ner_manatee+regests_non-crossing_all_validation_automatically_tagged.ner_tags.txt3 MB
- dataset_ner_manatee_all_all_testing.ner_tags.txt4 MB
- dataset_ner_manatee+regests_non-crossing_only-relevant_validation.sentences.txt468 kB
- dataset_ner_manatee_all_all_validation_automatically_tagged.ner_tags.txt5 MB
- dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged.ner_tags.txt198 kB
- dataset_ner_manatee+regests_all_only-relevant_training.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_validation.ner_tags.txt8 MB
- dataset_ner_manatee+regests_all_only-relevant_validation_automatically_tagged.ner_tags.txt263 kB
- dataset_ner_fuzzy-regex_non-crossing_all_training_automatically_tagged.ner_tags.txt45 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_testing.sentences.txt632 kB
- dataset_ner_regests_validation_automatically_tagged.sentences.txt128 kB
- dataset_ner_regests_validation_automatically_tagged.ner_tags.txt50 kB
- dataset_ner_manatee+regests_all_all_validation.ner_tags.txt4 MB
- dataset_ner_manatee+regests_non-crossing_only-relevant_training_automatically_tagged.sentences.txt2 MB
- dataset_ner_manatee_all_only-relevant_testing.sentences.txt498 kB
- dataset_ner_manatee_non-crossing_all_validation.sentences.txt8 MB
- dataset_ner_manatee_non-crossing_all_training.sentences.txt42 MB
- dataset_ner_manatee+regests_non-crossing_all_training_automatically_tagged.ner_tags.txt18 MB
- dataset_ner_manatee_all_only-relevant_validation_automatically_tagged.sentences.txt500 kB
- dataset_ner_manatee_non-crossing_only-relevant_testing_401-500_tagged.docx39 kB
- dataset_ner_fuzzy-regex_non-crossing_all_validation.ner_tags.txt8 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_only-relevant_training.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex_all_all_validation_automatically_tagged.sentences.txt39 MB
- dataset_ner_fuzzy-regex_all_only-relevant_validation.ner_tags.txt299 kB
- dataset_ner_manatee_all_all_training_automatically_tagged.sentences.txt63 MB
- dataset_ner_regests_training.ner_tags.txt411 kB
- dataset_ner_manatee+regests_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt468 kB
- dataset_ner_manatee_non-crossing_all_testing.ner_tags.txt2 MB
- dataset_ner_manatee+regests_non-crossing_all_validation.sentences.txt8 MB
- dataset_ner_manatee+regests_all_all_training_automatically_tagged.ner_tags.txt26 MB
- dataset_ner_fuzzy-regex_all_all_training_automatically_tagged.sentences.txt155 MB
- dataset_ner_manatee_non-crossing_only-relevant_validation.ner_tags.txt114 kB
- dataset_ner_manatee+regests_all_only-relevant_validation.sentences.txt628 kB
- dataset_ner_fuzzy-regex_all_all_training_automatically_tagged.ner_tags.txt63 MB
- dataset_ner_manatee_non-crossing_all_training_automatically_tagged.ner_tags.txt17 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex+regests_all_all_validation_automatically_tagged.ner_tags.txt16 MB
- dataset_ner_fuzzy-regex_all_all_training.sentences.txt156 MB
- dataset_ner_regests_training_automatically_tagged.ner_tags.txt424 kB
- dataset_ner_manatee_non-crossing_all_training_automatically_tagged.sentences.txt42 MB
- dataset_ner_manatee_all_all_validation.ner_tags.txt4 MB
- dataset_ner_fuzzy-regex+regests_non-crossing_all_training_automatically_tagged.sentences.txt110 MB
- dataset_ner_fuzzy-regex+regests_all_all_training_automatically_tagged.sentences.txt156 MB
- dataset_ner_fuzzy-regex_non-crossing_all_validation_automatically_tagged.sentences.txt25 MB
- dataset_ner_fuzzy-regex_non-crossing_all_validation.sentences.txt25 MB
- dataset_ner_fuzzy-regex_all_all_validation.ner_tags.txt12 MB
- dataset_ner_fuzzy-regex_all_only-relevant_training_automatically_tagged.ner_tags.txt1 MB
- dataset_ner_fuzzy-regex_non-crossing_only-relevant_training.sentences.txt2 MB
- dataset_ner_manatee_non-crossing_only-relevant_validation_automatically_tagged.sentences.txt340 kB
- dataset_ner_manatee_all_all_validation_automatically_tagged.sentences.txt14 MB
- dataset_ner_manatee_all_all_training.ner_tags.txt20 MB
- Název
- named-entity-recognition-annotations-large.zip
- Velikost
- 1.31 GB
- Formát
- application/zip
- Popis
- Sentences and NER tags for supervised training, validation, and testing of language models
- MD5
- 4bf38b89ed8948d3fa355b6e1e55d6de
- dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004.sentences.txt534 kB
- dataset_mlm_all_only-relevant_training_automatically_tagged_004.sentences.txt7 MB
- dataset_mlm_non-crossing_all_training_automatically_tagged_004.sentences.txt498 MB
- dataset_mlm_all_all_validation_automatically_tagged_004.sentences.txt77 MB
- dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007.sentences.txt6 MB
- dataset_mlm_all_all_training_automatically_tagged_004.sentences.txt599 MB
- dataset_mlm_all_only-relevant_validation_automatically_tagged_004.ner_tags.txt265 kB
- dataset_mlm_all_only-relevant_validation_automatically_tagged_004.sentences.txt717 kB
- dataset_mlm_all_only-relevant_validation_automatically_tagged_007.ner_tags.txt248 kB
- dataset_mlm_all_all_validation_automatically_tagged_007.ner_tags.txt28 MB
- dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004.ner_tags.txt2 MB
- dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_007.ner_tags.txt2 MB
- dataset_mlm_non-crossing_all_validation_automatically_tagged_007.sentences.txt64 MB
- dataset_mlm_non-crossing_all_training_automatically_tagged_004.ner_tags.txt203 MB
- dataset_mlm_non-crossing_all_training_automatically_tagged_007.ner_tags.txt184 MB
- dataset_mlm_all_only-relevant_training_automatically_tagged_004.ner_tags.txt3 MB
- dataset_mlm_non-crossing_only-relevant_training_automatically_tagged_004.sentences.txt6 MB
- dataset_mlm_all_only-relevant_training_automatically_tagged_007.ner_tags.txt3 MB
- dataset_mlm_all_all_training_automatically_tagged_004.ner_tags.txt241 MB
- dataset_mlm_all_all_training_automatically_tagged_007.ner_tags.txt220 MB
- dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_004.ner_tags.txt202 kB
- dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007.ner_tags.txt188 kB
- dataset_mlm_non-crossing_all_validation_automatically_tagged_004.sentences.txt64 MB
- dataset_mlm_non-crossing_only-relevant_validation_automatically_tagged_007.sentences.txt534 kB
- dataset_mlm_all_only-relevant_training_automatically_tagged_007.sentences.txt7 MB
- dataset_mlm_non-crossing_all_training_automatically_tagged_007.sentences.txt498 MB
- dataset_mlm_all_all_validation_automatically_tagged_004.ner_tags.txt30 MB
- dataset_mlm_all_all_validation_automatically_tagged_007.sentences.txt77 MB
- dataset_mlm_all_all_training_automatically_tagged_007.sentences.txt599 MB
- dataset_mlm_all_only-relevant_validation_automatically_tagged_007.sentences.txt717 kB
- dataset_mlm_non-crossing_all_validation_automatically_tagged_004.ner_tags.txt25 MB
- dataset_mlm_non-crossing_all_validation_automatically_tagged_007.ner_tags.txt23 MB