This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

ORATOR v3: corpus of spoken Czech monologues (transcriptions & audio)

Please use the following text to cite this item or export to a predefined format:
Kopřivová, Marie; et al., 2025, ORATOR v3: corpus of spoken Czech monologues (transcriptions & audio), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-5933.
Date issued
2025-05-28
Size
1200000 words
Language(s)
Description
The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) transcriptions are provided in the XML ELAN Annotation format, audio (with corresponding anonymization beeps) is in uncompressed 16-bit PCM WAV, mono, 16 kHz format. Another format option of the transcriptions is also available under less restrictive CC BY-NC-SA license at http://hdl.handle.net/11234/1-5932
Acknowledgement
This item isAcademic Use
and licensed under:
 Files in this item
Name
orator_v3.tar.gz
Size
36.27 GB
Format
application/x-gzip
Description
gzip Archive
MD5
e6817f4e92143f8b053d6699d291a1e4
Preview
  File Preview