ORATOR v3: corpus of spoken Czech monologues (transcriptions & audio)
Please use the following text to cite this item or export to a predefined format:
Kopřivová, Marie; et al., 2025,
ORATOR v3: corpus of spoken Czech monologues (transcriptions & audio), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5933.
Authors
Kopřivová, Marie ; et al.
Item identifier
Project URL
Date issued
2025-05-28
Size
1200000 words
Language(s)
Description
The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) transcriptions are provided in the XML ELAN Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-5932
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2023044
Project name:Český národní korpus
Subject(s)
Collections
Files in this item
- Name
- orator_v3.tar.gz
- Size
- 36.27 GB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- e6817f4e92143f8b053d6699d291a1e4

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

