SYN2025: representative corpus of written Czech
Please use the following text to cite this item or export to a predefined format:
Křen, Michal; et al., 2025,
SYN2025: representative corpus of written Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-6110.
Authors
Křen, Michal ; et al.
Item identifier
Project URL
Date issued
2025
Size
100000000 words
Language(s)
Description
Representative corpus of contemporary written (printed) Czech sized 100 MW. It was created as a representation of printed language from 2020–2024 containing a wide range of text types (fiction, professional literature, newspapers etc.). The corpus is lemmatized, morphologically and syntactically annotated by a combination of various methods. The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The vertical format is a sequence of lines. Each of the lines is either a structure (that starts with '<' and ends with '>') or a token (with a fixed set of tab-separated columns). The columns of the SYN2025 token lines are described in more detail at https://wiki.korpus.cz/doku.php/en:seznamy:syn2025_attributes
The data provided here exactly correspond to those available via the KonText query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2023044
Project name:Český národní korpus
Subject(s)
Collections
Files in this item
- Name
- syn2025.xz
- Size
- 1.21 GB
- Format
- application/x-xz
- Description
- MD5
- 3308a2543e6d7335e4c506c509a01dbe

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

