This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Etalon 1.0

Please use the following text to cite this item or export to a predefined format:
Skoumalová, Hana, 2021, Etalon 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3698.
Date issued
2021-06-01
Size
153789 sentences
Language(s)
Description
Etalon is a manually annotated corpus of contemporary Czech. The corpus contains 1,885,589 words (2,265,722 tokens) and is annotated in the same way as SYN2020 of the Czech National Corpus. The corpus includes fiction (ca 24%), professional and scientific literature (ca 40%) and newspapers (ca 36%). The corpus is provided in a vertical format, where sentence boundaries are marked with a blank line. Every word form is written on a separate line, followed by five tab-separated attributes: syntactic word, lemma, sublemma, tag and verbtag. The texts are shuffled in random chunks of 100 words at maximum (respecting sentence boundaries).
Acknowledgement
 Files in this item
Name
Etalon.tgz
Size
17.22 MB
Format
application/x-gzip
Description
Etalon - annotated corpus
MD5
7dd13171d135d33b1af5065ac9aa26e8
Preview
  File Preview
    • Etalon.tgz116 MB