Prague DaTabase of Spoken Czech

PDTSC 1.0 brings you a multi-purpose corpus of spoken language. 768,888 tokens, 73,374 sentences and 7,324 minutes of spontaneous dialog speech have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcription and manually reconstructed text.

PDTSC 1.0 is a delayed release of data annotated in 2012. It is an update of Prague Dependency Treebank of Spoken Language (PDTSL) 0.5 (published in 2009). In 2017, Prague Dependency Treebank of Spoken Czech (PDTSC) 2.0 was published as an update of PDTSC 1.0.

The corpus consists of two types of dialogs. First we used the Czech portion of the Malach project corpus. The Czech Malach corpus consists of lightly moderated dialogs (testimonies) with Holocaust survivors, originally recorded for the Shoa memory project by the Shoa Visual History Foundation. The dialogs usually start with shorter turns but continue as longer monologues by the survivors, often showing emotion, disfluencies caused by recollecting interviewee’s distant memories, etc.

The second portion of the corpus consists of dialogs that were recorded within the Companions project. The domain is reminiscing about personal photograph collections. The goal of this project was to create virtual companions that would be able to have a natural conversation with humans.

Layers of annotation

PDTSC 1.0 has three hierarchical layers and one external base layer (audio). The bottom layer of the corpus (z-layer) contains automatic speech recognition output aligned to audio. It is a simplified token layer which is interlinked with the manual transcription using the synchronization points. The second layer (w-layer) is a literal manual transcript, i.e. everything the speaker has said including all slips of the tongue, coughing, laugh etc. The topmost layer (m-layer), called speech reconstruction, is an edited version of the literal transcript. Disfluencies are removed and sentences are smoothed to meet written-text standards. There are many ways to produce correct written text from a literal transcript. To capture this fact, we provide multiple parallel annotations for each transcript (two or three different versions made by different annotators).