ORATOR v3: corpus of spoken Czech monologues (transcriptions)
Please use the following text to cite this item or export to a predefined format:
Kopřivová, Marie; et al., 2025,
ORATOR v3: corpus of spoken Czech monologues (transcriptions), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5932.
Authors
Kopřivová, Marie ; et al.
Item identifier
Project URL
Date issued
2025-05-28
Size
1200000 words
Language(s)
Description
The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available to registered users of the CNC via KonText at https://www.korpus.cz/kontext/query?corpname=orator_v3 Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-5933
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2023044
Project name:Český národní korpus
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- orator_v3_vert.gz
- Size
- 13.64 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- c019894684317ac6d48b53b39b7714e1

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

