This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

ParCzech4Speech 1.0

Please use the following text to cite this item or export to a predefined format:
Stankov, Vladislav; Kopp, Matyáš and Bojar, Ondřej, 2025, ParCzech4Speech 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-5946.
Date issued
2025-06-27
Size
2695 hours
Language(s)
Description
We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours of aligned speech from 587 speakers. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment. The dataset is offered in three flexible variants: (1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries, (2) unsegmented preserving original utterance flow across sentences, and (3) a raw-alignment for further custom refinement for other possible tasks. Note: This release contains alignment data and text segments (official and recognized transcripts). The source audio must be obtained separately from the AudioPSP 24.01 corpus , using the 'filePath' column to locate the corresponding audio file and the 'start'/ 'end' timestamps to extract specific segments. The official transcripts are available in ParCzech 4.0 corpus (http://hdl.handle.net/11234/1-5360). The original audio files are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404). Note: All three variants are provided in both .tsv (tab-separated values) and .parquet (columnar binary) formats. The data content is identical across formats.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
unsegmented.tsv
Size
637.48 MB
Format
application/octet-stream
Description
Unknown
MD5
9014ce3cf737e5cbafb45ee1568adae4
Preview
  File Preview
Name
unsegmented.parquet
Size
297.92 MB
Format
application/octet-stream
Description
Unknown
MD5
bbe48dc7c33b2f3c42e40d94fd18f0a5
Preview
  File Preview
Name
sentence_segmented.tsv
Size
313.45 MB
Format
application/octet-stream
Description
Unknown
MD5
e95bd51bfbc9d9df02ec794b2e22a4d2
Preview
  File Preview
Name
sentence_segmented.parquet
Size
141.64 MB
Format
application/octet-stream
Description
Unknown
MD5
99149576be04314b196727abcf709c60
Preview
  File Preview
Name
raw_alignment.7z
Size
682.35 MB
Format
application/octet-stream
Description
Unknown
MD5
d3254b00d93f1fa9da7d84afa88c65e3
Preview
  File Preview
Name
raw_alignment_zstd.parquet
Size
908.32 MB
Format
application/octet-stream
Description
Unknown
MD5
691e39ad3610902d9c98600f01915f06
Preview
  File Preview
Name
README.md
Size
9.44 KB
Format
application/octet-stream
Description
Unknown
MD5
59f2ca43cbde5fe4871df64e17e6b446
Preview
  File Preview