Show simple item record Kopřivová, Marie Komrsková, Zuzana Lukeš, David Poukarová, Petra Škarpová, Marie 2018-01-02T12:21:53Z 2018-01-02T12:21:53Z 2017-12-28
dc.description ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at
dc.language.iso ces
dc.publisher Charles University, Faculty of Arts, Institute of the Czech National Corpus
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.subject balanced corpus
dc.subject spoken language
dc.subject informal language
dc.subject Czech
dc.title ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIN
contact.person David Lukeš Charles University, Faculty of Arts, Institute of the Czech National Corpus
sponsor Ministerstvo školství, mládeže a tělovýchovy LM2015044 Český národní korpus nationalFunds 1000000 words
files.size 13525766
files.count 1

 Files in this item

12.9 MB
the data
 Download file

Show simple item record