autoři katalogu : Jana Koudelová, David Stejskal, Jan Tippner, Michal Kloiber, Jiří Bláha, Petr Růžička, Tomáš Kolář, Michal Rybníček, Jaroslav Buzek, Tomáš Dostál, Jaromír Milch, Jan Baar, Dominik Hess, Jan Zlámal, Hanuš Vavrčík, Radek Bryol ; editor: Jana Koudelová ; fotografie: Jaroslav Hrivnák, Michal Kloiber, Jan Kolář, Tomáš Kolář, David Stejskal, Willy Tegel., Seznam zkratek, Obsahuje bibliografii, and Anglické resumé
Model trained for Czech POS Tagging and Lemmatization using Czech version of BERT model, RobeCzech. Model is trained on data from Prague Dependency Treebank 3.5. Model is a part of Czech NLP with Contextualized Embeddings master thesis and presented a state-of-the-art performance on the date of submission of the work.
Demo jupyter notebook is available on the project GitHub.
A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.