This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

ORATOR v3: corpus of spoken Czech monologues (transcriptions)

Please use the following text to cite this item or export to a predefined format:
Kopřivová, Marie; et al., 2025, ORATOR v3: corpus of spoken Czech monologues (transcriptions), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-5932.
Date issued
2025-05-28
Size
1200000 words
Language(s)
Description
The ORATOR v3 corpus contains monologues by native Czech speakers. The typical situations include a lecture, instruction, guided tour, welcome address, sermon etc. The corpus is composed of 489 recordings from 2005–2019 and contains 1 212 729 orthographic words (i.e. a total of 1 542 133 tokens including punctuation); a total of 468 different speakers appear in the probes. The transcription was manual and it is linked to the corresponding audio track. ORATOR v3 is lemmatized and morphologically tagged according to the SYN2020 standard. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus exactly correspond to the corpus available to registered users of the CNC via KonText at https://www.korpus.cz/kontext/query?corpname=orator_v3 Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-5933
Acknowledgement
 Files in this item
Name
orator_v3_vert.gz
Size
13.64 MB
Format
application/x-gzip
Description
gzip Archive
MD5
c019894684317ac6d48b53b39b7714e1
Preview
  File Preview