This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

SYN2025: representative corpus of written Czech

Please use the following text to cite this item or export to a predefined format:
Křen, Michal; et al., 2025, SYN2025: representative corpus of written Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-6110.
Date issued
2025
Size
100000000 words
Language(s)
Description
Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text types (fiction, professional literature, newspapers etc.). The corpus is lemmatized, morphologically and syntactically annotated by a combination of various methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The vertical format is a sequence of lines. Each of the lines is either a structure (that starts with '<' and ends with '>') or a token (with a fixed set of tab-separated columns). The columns of the SYN2025 token lines are described in more detail at https://wiki.korpus.cz/doku.php/en:seznamy:syn2025_attributes The data provided here exactly correspond to those available via the KonText query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Acknowledgement
This item isAcademic Use
and licensed under:
 Files in this item
Name
syn2025.xz
Size
1.21 GB
Format
application/x-xz
Description
MD5
3308a2543e6d7335e4c506c509a01dbe
Preview
  File Preview