ParCzech4Speech 1.0
Please use the following text to cite this item or export to a predefined format:
Stankov, Vladislav; Kopp, Matyáš and Bojar, Ondřej, 2025,
ParCzech4Speech 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5946.
Authors
Item identifier
Date issued
2025-06-27
Size
2695 hours
Language(s)
Description
We introduce ParCzech4Speech 1.0, a processed version of the ParCzech 4.0 corpus, targeted at speech modeling tasks with the largest variant containing 2,695 hours of aligned speech from 587 speakers. We combined the sound recordings of the Czech parliamentary speeches with the official transcripts. The recordings were processed with WhisperX and Wav2Vec 2.0 to extract automated audio-text alignment.
The dataset is offered in three flexible variants:
(1) sentence-segmented for automatic speech recognition and speech synthesis tasks with clean boundaries,
(2) unsegmented preserving original utterance flow across sentences, and
(3) a raw-alignment for further custom refinement for other possible tasks.
Note: This release contains alignment data and text segments (official and recognized transcripts). The source audio must be obtained separately from the AudioPSP 24.01 corpus , using the 'filePath' column to locate the corresponding audio file and the 'start'/ 'end' timestamps to extract specific segments.
The official transcripts are available in ParCzech 4.0 corpus (http://hdl.handle.net/11234/1-5360).
The original audio files are available in AudioPSP 24.01 corpus (http://hdl.handle.net/11234/1-5404).
Note: All three variants are provided in both .tsv (tab-separated values) and .parquet (columnar binary) formats. The data content is identical across formats.
Acknowledgement
NPO (EC NextGenEU RRF)
Project code: MPO 60273/24/21300/21000
Project name:CEDMO 2.0 NPO
MŠMT OP JAK Mezisektorová spolupráce
Project code: CZ.02.01.01/00/23_020/0008518
Project name: Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- unsegmented.tsv
- Size
- 637.48 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 9014ce3cf737e5cbafb45ee1568adae4

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- unsegmented.parquet
- Size
- 297.92 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- bbe48dc7c33b2f3c42e40d94fd18f0a5

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- sentence_segmented.tsv
- Size
- 313.45 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- e95bd51bfbc9d9df02ec794b2e22a4d2

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- sentence_segmented.parquet
- Size
- 141.64 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 99149576be04314b196727abcf709c60

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- raw_alignment.7z
- Size
- 682.35 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- d3254b00d93f1fa9da7d84afa88c65e3

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- raw_alignment_zstd.parquet
- Size
- 908.32 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 691e39ad3610902d9c98600f01915f06

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- README.md
- Size
- 9.44 KB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 59f2ca43cbde5fe4871df64e17e6b446

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

