SYN v14: large corpus of written Czech
Please use the following text to cite this item or export to a predefined format:
Křen, Michal; et al., 2025,
SYN v14: large corpus of written Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-6111.
Authors
Křen, Michal ; et al.
Item identifier
Date issued
2025
Size
5 490 000 000 words
Language(s)
Description
Corpus of contemporary written (printed) Czech sized almost 5.5 GW (i.e. 6.6 billion tokens). It covers mostly the 1990-2024 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v14 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the unified CNC tagset, and features also an annotation of multiword expressions. The data provided here exactly correspond to those available via the KonText query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
SYN v14 is provided in a semi-XML / CoNLL-U-like vertical format used as an input to the Manatee query engine. The vertical format is a sequence of lines. Each of the lines is either a structure (that starts with '<' and ends with '>') or a token (with a fixed set of tab-separated columns). The columns of the SYN v14 token lines are as follows: word / sword [syntactic word] / lemma / sublemma / tag / pos / case / verbtag [verbal tag] / mwe_lemma [multiword lemma] / mwe_tag [multiword tag]
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2023044
Project name:Český národní korpus
Subject(s)
Collections
Version History
Files in this item
- Name
- syn_v14.xz
- Size
- 25.44 GB
- Format
- application/x-xz
- Description
- MD5
- c56b6671645137d9cd542884e03c372e

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

