This is not the latest version of this item. The latest version can be found here.
SYN v9: large corpus of written Czech
Please use the following text to cite this item or export to a predefined format:
Křen, Michal; et al., 2021,
SYN v9: large corpus of written Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-4635.
Authors
Křen, Michal ; et al.
Item identifier
Date issued
2021-12-05
Size
4700000000 words
Language(s)
Description
Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus.
SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2018137
Project name:Český národní korpus
Subject(s)
Collections
Version History
Files in this item
- Name
- syn_v9.xz
- Size
- 21.87 GB
- Format
- application/x-xz
- Description
- SYNv9 corpus data
- MD5
- 82f4c62723618205b6134196f5eee93d

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

