Show simple item record

 
dc.contributor.author Křen, Michal
dc.contributor.author Cvrček, Václav
dc.contributor.author Henyš, Jan
dc.contributor.author Hnátková, Milena
dc.contributor.author Jelínek, Tomáš
dc.contributor.author Kocek, Jan
dc.contributor.author Kováříková, Dominika
dc.contributor.author Křivan, Jan
dc.contributor.author Milička, Jiří
dc.contributor.author Petkevič, Vladimír
dc.contributor.author Procházka, Pavel
dc.contributor.author Skoumalová, Hana
dc.contributor.author Šindlerová, Jana
dc.contributor.author Škrabal, Michal
dc.date.accessioned 2022-01-11T16:52:48Z
dc.date.available 2022-01-11T16:52:48Z
dc.date.issued 2021-12-05
dc.identifier.uri http://hdl.handle.net/11234/1-4635
dc.description Corpus of contemporary written (printed) Czech sized 4.7 GW (i.e. 5.7 billion tokens). It covers mostly the 1990-2019 period and features rich metadata including detailed bibliographical information, text-type classification etc. SYN v9 contains a wide variety of text types (fiction, non-fiction, newspapers), but the newspapers prevail noticeably. The corpus is lemmatized and morphologically tagged by the new CNC tagset first utilized for the annotation of the SYN2020 corpus. SYN v9 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.cz with one important exception: the corpus is shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) with ordering randomized within the given document.
dc.language.iso ces
dc.publisher Charles University, Faculty of Arts, Institute of the Czech National Corpus
dc.relation.replaces http://hdl.handle.net/11234/1-1846
dc.rights Czech National Corpus (Shuffled Corpus Data)
dc.rights.uri https://lindat.mff.cuni.cz/repository/xmlui/page/license-cnc
dc.source.uri https://wiki.korpus.cz/doku.php/en:cnk:syn:verze9
dc.subject corpus
dc.subject written language
dc.title SYN v9: large corpus of written Czech
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label ACA
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Michal Křen michal.kren@ff.cuni.cz Charles University, Faculty of Arts, Institute of the Czech National Corpus
sponsor Ministerstvo školství, mládeže a tělovýchovy LM2018137 Český národní korpus nationalFunds
size.info 4700000000 words
files.size 23486303548
files.count 1


 Files in this item

This item is
Academic Use
and licensed under:
Czech National Corpus (Shuffled Corpus Data)
Attribution Required Noncommercial
Icon
Name
syn_v9.xz
Size
21.87 GB
Format
application/x-xz
Description
SYNv9 corpus data
MD5
82f4c62723618205b6134196f5eee93d
 Download file

Show simple item record