Harvested from: LINDAT/CLARIAH-CZ repository / Language: Czech and German

1. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

Creator:: Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: NER, named entity recognition, and Medieval
Language:: Czech, English, German, and Latin
Description:: This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)

Creator:: Novotný, Vít, Luger, Kristýna, Štefánik, Michal, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: NER, named entity recognition, and Medieval
Language:: Czech, English, German, and Latin
Description:: This is an open dataset of sentences from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains a corpus for language modeling and human annotations for named entity recognition (NER).
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

3. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

Creator:: Novotný, Vít, Seidlová, Kristýna, Vrabcová, Tereza, and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: image and corpus
Subject:: ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
Language:: German, Czech, Latin, and English
Description:: This is an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

Creator:: Novotný, Vít and Horák, Aleš
Publisher:: Masaryk University, Brno
Type:: text and corpus
Subject:: ocr, optical character recognition, language identification, image super-resolution, sr, and Medieval
Language:: Czech, English, German, and Latin
Description:: These are supplementary materials for an open dataset of scanned images and OCR texts from 19th and 20th century letterpress reprints of documents from the Hussite era. The dataset contains human annotations for layout analysis, OCR evaluation, and language identification and is available at http://hdl.handle.net/11234/1-4615. These supplementary materials contain OCR texts from different OCR engines for book pages for which we have both high-resolution scanned images and annotations for OCR evaluation.
Rights:: Public Domain Dedication (CC Zero), http://creativecommons.org/publicdomain/zero/1.0/, and PUB

5. Additional German-Czech reference translations of the WMT'11 test set

Creator:: Bojar, Ondřej, Zeman, Daniel, Dušek, Ondřej, Břečková, Jana, Farkačová, Hana, Grošpic, Pavel, Kačenová, Kristýna, Knechtová, Eva, Koubová, Anna, Lukavská, Jana, Nováková, Petra, and Petrdlíková, Jana
Publisher:: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Type:: text and corpus
Subject:: reference translation, German-Czech, and parallel corpus
Language:: German and Czech
Description:: Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
Rights:: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0), http://creativecommons.org/licenses/by-nc-sa/3.0/, and PUB

6. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.0)

Creator:: Savary, Agata, Ramisch, Carlos, Cordeiro, Silvio Ricardo, Sangati, Federico, Vincze, Veronika, QasemiZadeh, Behrang, Candito, Marie, Cap, Fabienne, Giouli, Voula, Stoyanova, Ivelina, Doucet, Antoine, Adalı, Kübra, Barbu Mititelu, Verginica, Bejček, Eduard, El Maarouf, Ismail, Eryiğit, Gülşen, Galea, Luke, Ha-Cohen Kerner, Yaakov, Liebeskind, Chaya, Monti, Johanna, Parra Escartín, Carla, Kovalevskaitė, Jolanta, Krek, Simon, van der Plas, Lonneke, Aceta, Cristina, Aduriz, Itziar, Antoine, Jean-Yves, Attard, Greta, Azzopardi, Kirsty, Boizou, Loic, Bonnici, Janice, Boz, Mert, Bumbulienė, Ieva, Busuttil, Jael, Caruso, Valeria, Cherchi, Manuela, Constant, Matthieu, Czerepowicka, Monika, De Santis, Anna, Dimitrova, Tsvetana, Dinç, Tutkum, Elyovich, Hevi, Fabri, Ray, Farrugia, Alison, Findlay, Jamie, Fotopoulou, Aggeliki, Foufi, Vassiliki, Galea, Sara Anne, Gantar, Polona, Gatt, Albert, Gatt, Anabelle, Herrero, Carlos, Iñurrieta, Uxoa, Jagfeld, Glorianna, Hnátková, Milena, Ionescu, Mihaela, Klyueva, Natalia, Koeva, Svetla, Kovács, Viktória, Kuzman, Taja, Leseva, Svetlozara, Louisou, Sevi, Lynn, Teresa, Malka, Ruth, Martínez Alonso, Héctor, McCrae, John, de Medeiros Caseli, Helena, Miral, Ayşenur, Muscat, Amanda, Nivre, Joakim, Oakes, Michael, Onofrei, Mihaela, Parmentier, Yannick, Pasquer, Caroline, Pia di Buono, Maria, Priego Sanchez, Belem, Raffone, Annalisa, Ramisch, Renata, Rimkutė, Erika, Rizea, Monica-Mihaela, Simkó, Katalin, Spagnol, Michael, Stefanova, Valentina, Stymne, Sara, Sulubacak, Umut, Tabone, Nicole, Tanti, Marc, Todorova, Maria, Urešová, Zdenka, Villavicencio, Aline, and Zilio, Leonardo
Publisher:: PARSEME
Type:: text and corpus
Subject:: Multiword expressions, verbal multiword expressions, idioms, light-verb constructions, verb-particle constructions, and inherently reflexive verbs
Language:: Bulgarian, Czech, German, Modern Greek (1453-), Spanish, Persian, French, Hebrew, Hungarian, Italian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovenian, Swedish, and Turkish
Description:: The PARSEME shared task aims at identifying verbal MWEs in running texts. Verbal MWEs include idioms (let the cat out of the bag), light verb constructions (make a decision), verb-particle constructions (give up), and inherently reflexive verbs (se suicider 'to suicide' in French). VMWEs were annotated according to the universal guidelines in 18 languages. The corpora are provided in the parsemetsv format, inspired by the CONLL-U format. For most languages, paired files in the CONLL-U format - not necessarily using UD tagsets - containing parts of speech, lemmas, morphological features and/or syntactic dependencies are also provided. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe). This item contains training and test data, tools and the universal guidelines file.
Rights:: PARSEME Shared Task Data (v. 1.0) Agreement, https://lindat.mff.cuni.cz/repository/xmlui/page/licence-mwe-1.0, and PUB

7. Botanicus Digital Library

Type:: corpus
Subject:: Germanistik
Language:: Chinese, Czech, English, French, German, Latin, and Spanish
Description:: Digital copies of historical botanic papers from the Missouri Botanical Garden Library; Bilddigitalisate von historischen botanischen Schriften; deutschsprachige Texte stellen nur einen Teilbereich dar
Rights:: Not specified

1. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents

2. A Human-Annotated Dataset for Language Modeling and Named Entity Recognition in Medieval Documents (2023-01-05)

3. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents

4. A Human-Annotated Dataset of Scanned Images and OCR Texts from Medieval Documents: Supplementary Materials

5. Additional German-Czech reference translations of the WMT'11 test set

6. Annotated corpora and tools of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions (edition 1.0)

7. Botanicus Digital Library

8. C4Corpus (CC BY-NC part)

9. C4Corpus (CC BY-NC-ND part)

10. C4Corpus (CC BY-NC-SA part)

Limit your search

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Contributor

Show values starting with

Coverage

Creator

Show values starting with

Language

Show values starting with

Publisher

Show values starting with

Rights

Show values starting with

Subject

Show values starting with

Type

Show values starting with

Date

Original context has metadata only

Harvested from