# ParCzech 3.0 walk through ## Source data format (HTML and audio) All HTML files that have been met during data gathering are stored in `parczech-3.0-html` with an absolute URL in the files' paths. Audio files are stored in `parczech-3.0-audio-TERM-YEAR` like archives. `TERM` stands for election term to the Chamber of Deputies (`ps7` and `ps8`). `YEAR` stands for the year of recording. - parczech-3.0-audio-ps7-2013 - parczech-3.0-audio-ps7-2014 - parczech-3.0-audio-ps7-2015 - parczech-3.0-audio-ps7-2016 - parczech-3.0-audio-ps7-2017 - parczech-3.0-audio-ps8-2017 - parczech-3.0-audio-ps8-2018 - parczech-3.0-audio-ps8-2019 - parczech-3.0-audio-ps8-2020 - parczech-3.0-audio-ps8-2021 - beginning of the year 2021 ## Parla-CLARIN TEI format ### Schema `parczech-3.0-tei.schema` contains rng files that defines the structure of XML files. All TEI files validate Parla-CLARIN schema. As the ParCzech schema is much stricter than Parla-CLARIN schema we also added it. ### Raw TEI version `parczech-3.0-tei` contains teiCorpus and TEI files in its raw version. It contains the same metadata as the annotated version but no linguistic annotations and audio alignment. ### Annotated TEI version Annotated version contains teiCorpus and TEI files that are linguistically annotated and aligned: **UDPipe 2**: tokenization, morphology and syntax > Milan Straka (2018): [UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task](https://www.aclweb.org/anthology/K18-2020/). In: Proceedings of CoNLL 2018: The SIGNLL Conference on Computational Natural Language Learning, pp. 197-207, Association for Computational Linguistics, Stroudsburg, PA, USA, ISBN 978-1-948087-72-8 **Nametag 2**: CNEC 2.0 named entities > Straková, Jana, 2021, *NameTag 2*, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3633. **Audio alignment **: Original audio files are used. ## ASR format ### Intro To understand how the data are organized, its good to have an overview how they were processed. Originally, there were mp3 files of length about 14 minutes with stenographic transcripts, obviously they are not 100% precise representation of what was said in the audio, they differs in a way how spoken language differs from the written one. To get timings for every word in the transcript there was used a tool, that outputs recognized words from the audio and their starting and ending time and also we can compare edit distance between original word and recognized one. Recognized transcriptions with timings allow to align original (stenographic) transcriptions to audio. Once alignment is done we can divide each transcript into segments, say using sentence boundaries, and for each segment cut out corresponding audio segment. To know if a given segment is good or not (without listening), we compute different of statistics for each segment. Sometimes this alignment may fail depending on the tool used for time extraction or when original transcript do not match the recording, so there is a need to clean the data from bad alignments. We provide users with our cleaning scheme, however the data which did not survived are also saved for completeness. Moreover, with given statistics anyone can create custom filtering. Side note 1, word in the context of this corpus means every non empty sequence of characters except sequences containing only punctuation. The corpus can be viewed as a set of segments placed under specific folder determined by the source audio. For example, if there was an audio `2018011711181132.mp3` which was split into 69 segments, then under the folder `2018011711181132` there will be folders `00`, `01`, `02`, up to folder `68` (if all the segments survived during cleaning). Each audio folder (which stores segment subfolders) has its own statistics computed on the level of whole stenographic transcript, since, as already mentioned, it may occur that the whole transcript does not match the audio (e.g. error during scraping). Audio naming follows the pattern below, take as example `2018011711181132.mp3`: * 2018 01 17 - year, month and day when the audio was recorded * 11:18 11:32 - starting and ending time of the recording After removing bad segments, folder `2018011711181132` may not contain all segments. The content of the folder can be the following `01` `03` `09` `14` `18` `21` `23` `34` `39` `41` `50` `58` `65` `70` `02` `05` `13` `15` `19` `22` `28` `38` `40` `47` `57` `60` `68` `stats.tsv` So we see that some segments did not passed the filtering phase. Also we see here file with global statistics `stats.tsv`. Side note 2, normalized edit distance will mean Levenshtein distance normalized by `max(len(orig_word), len(recognized_word))`. When computing edit distance we always ignore cases when empty string in stenographic transcript is aligned to any recognized word, this is true for all versions of edit distances computed in statistics. ### Global statistics The global statistics for the whole stenographic transcription, for example stored in `2018011711181132/stats.tsv`, are the following: * `missed_percentage` - percentage of missed words * `continuous_gaps_cnt` - imagine if in the alignment we have a sequence of original words aligned to a sequence of more than one gap. Here we count a sequence of gaps as one gap. * `continuous_gaps_cnt_normalized1` - here we normalize previous statistic by number of words + number of continuous gaps. * `continuous_gaps_cnt_normalized2` - here we normalize previous statistic by the number of words in the stenographic transcript. * `median_normalized_dist` - median of the normalized edit distances. One important note is that here edit distance is computed only on words longer than 2, because using words of smaller lengths gives over optimistic results. Other note is, that here we ignore situations when original word (non empty) is aligned to a gap, this variation of edit distance is said to be without gaps. * `normalized_dist_{60, 70, 75, 80, 90}` - Nth percentile of the normalized edit distance (without gaps). * `median_normalized_dist_with_gaps` - case of `median_normalized_dist` where we take gaps into account (non empty stenographic word is aligned to an empty string) * `normalized_dist_with_gaps_{60, 70, 75, 80, 90}` - Nth percentile of the normalized edit distance with gaps. Side note 3, by observation, `continuous_gaps_cnt_normalized1` works well for detecting miss-alignments on the level of the whole file. Nth percentile of the normalized edit distance(say it is 0.4), tells that N percents of the original transcription have normalized edit distance at most 0.4. ### Segment description Diving into the segment folder, say in `2018011711181132/01`, we will see the following files: * `2018011711181132.asr` - all files with ending `.asr` contain text of the segment in the upper case, without punctuation. * `2018011711181132.prt` - all files with ending `.prt` (alias pretty) have the tokenized text of the segment as it was the stenographic transcription (with all punctuation and upper/lower cases) * `2018011711181132.wav` - audio representation of the segment * `2018011711181132.speakers` - all files with ending `.speakers` list speakers' IDs appeared in the segment * `2018011711181132.words` - a tab separated file with columns: - `word` - word from stenographic transcript - `word id` - id of the word, this can be used to get syntactic/morphologic/... details about the word in the TEI version of the corpus. - `start time` (can be -1 if the word was not recognized), - `ending time` (can be -1), - `average char duration of the word` (this will be explain when we talk about segment level statistics, search for `avg_char_duration`) - `speaker` - who said the word * `stats.tsv` - statistics on the segment ### Segment statistics Statistics for each segment (for example `2018011711181132/01/stats.tsv`): * `words_cnt` - number of words * `chars_cnt` - sum of the lengths of words * `duration` - duration of the segment in seconds * `speakers_cnt` - number of speakers * `missed_words` - number of missed words * `missed_words_percentage` - `missed_words` normalized by the number of words * `missed_chars` - sum of lengths of missed words * `missed_chars_percentage` - `missed_chars` normalized by the sum of lengths of the words * `recognized_sound_coverage` - percentage of audio length that is covered by some original words aligned to non empty strings * `correct_end` - if segment has correct ending time (only the case for the last segment in the audio) * `avg_char_duration` - this is average of some Xi, where Xi = duration(word_i)/len(word_i) * `std_char_duration` - standard deviation for the above statistic * `median_char_duration` - instead of computing average in avg_char_duration, we compute the median * `char_duration_{60, 70, 75, 80, 90}` - Nth percentile variant for the median_char_duration * `avg_norm_word_dist` - average of the normalized edit distances (without gaps, i.e. when original word is aligned to an empty string) for each word in the segment, as discussed in the global statistics. However here we take into account words of any length. * `std_norm_word_dist` - standard deviation of the normalized edit distance (without gaps) * `median_norm_word_dist` - median of the normalized edit distance (without gaps) * `char_norm_word_dist_{60, 70, 75, 80, 90}` - Nth percentile of the normalized edit distance (without gaps), normailization is done by the maximum of the lengths * `avg_norm_word_dist_with_gaps` - equivalent to `avg_norm_word_dist`, but now we are not ignoring gaps * `std_norm_word_dist_with_gaps` - equivalent to `std_norm_word_dist`, but now we are not ignoring gaps * `median_norm_word_dist_with_gaps` - equivalent to `median_norm_word_dist`, but now we are not ignoring gaps * `char_norm_word_dist_with_gaps_{60, 70, 75, 80, 90}` - equivalent to `char_norm_word_dist_N`, but now we are not ignoring gaps ### Corpus organization File with information about speakers `parczech-3.0-speakers.tsv` is a tab separated file with columns: id, surname, forename, gender, birth. Be careful it contains missing values. Finally, all these segments were divided in to mutually disjoint sets: (common) train, speakers dev, speakers test, context dev, context test, segments dev, segments test, other (segments that did not pass filtering). Each set is simply a folder with segments. The segments are grouped based on the source audio. For better understanding here is an example of the train set (in a tree format) * parczech-3.0-asr-train * 2013112513581412 * 00 * 2013112513581412.asr * 2013112513581412.speakers * 2013112513581412.words * 2013112513581412.prt * 2013112513581412.wav * stats.tsv * 04 * 07 * ... * 26 * 28 * stats.tsv * 2013112514081422 * 2013112514181432 * 2013112514281442 * ... * parczech-3.0-asr-other * parczech-3.0-asr-context.dev * parczech-3.0-asr-context.test * parczech-3.0-asr-segments.dev * parczech-3.0-asr-segments.test * parczech-3.0-asr-speakers.dev * parczech-3.0-asr-speakers.test * parczech-3.0-speakers.tsv (file with information about speakers) Since division into sets was done on the segment level, it may appear that segments coming from one audio are in different sets. ### Division into train, test, dev sets The dev and test sets were created for three different purposes: * Speakers Dev and Test were extracted from the clean data first, taking all utterances of a few speakers. This dev and test are thus useful in experiments, where you want to assess system performance on new speakers. The proportion of men and women in this dev and test set is artificially balanced, oversampling women compared to the corpus average. * Context Dev and Test were formed in a way that preserves partitioning from original audio recordings. Thus, few audio recordings were taken out from the clean data and all their segments put into the context dev or test set. This way, the context of each utterance is available and discourse phenomena can be studied up to the level of the original division info files (and subject to filtering) * Segments Dev and Test were created from the rest of filtered data by sampling random segments. ### Statistics #### Filtering stats Statistics | Original data | Filtered data |:-----------------------------------|---------------:|------------:| Hours | 3071.57 | 1332.38 Segments | 1391785 | 606540 Average segment duration in seconds | 7.94 +- 11.53 | 7.90 +- 7.14 Average number of words in a segment | 15.91 +- 16.32 | 16.72 +- 13.73 Words | 22153778 | 10146591 Aligned words percentage | 89.6% | 96.3% Unique Speakers | 475 | 474 Segment size range in words |[1, 1058] |[2, 138] Duration range |[0.0, 720.76] |[0.82, 53.99] #### Train, dev, test stats | Set name | Segments | Documents | Hours | Avg. dur. in sec. | Words | Avg. # words | Unique speakers | |:------------|-----------:|------------:|--------:|:--------------------|--------:|:---------------|------------------:| | Train | 579169 | 19931 | 1271 | 7.9 +- 7.1 | 9679268 | 16.7 +- 13.7 | 417 | | Speak. Dev | 4596 | 744 | 10.7 | 8.4 +- 7.1 | 81708 | 17.8 +- 13.4 | 30 | | Speak. Test | 4261 | 689 | 10.6 | 9.0 +- 7.0 | 78277 | 18.4 +- 12.9 | 30 | | Cont. Dev | 4556 | 149 | 10 | 7.9 +- 7.0 | 76512 | 16.8 +- 13.5 | 186 | | Cont. Test | 4868 | 149 | 10 | 7.4 +- 6.8 | 78360 | 16.1 +- 13.5 | 186 | | Segm. Dev | 4575 | 4020 | 10 | 7.9 +- 7.3 | 76243 | 16.7 +- 14.0 | 301 | | Segm. Test | 4515 | 3986 | 10 | 8.0 +- 7.2 | 76223 | 16.9 +- 13.8 | 291 | ## Reference > Matyáš Kopp, Vladislav Stankov, Jan Oldřich Krůza, Pavel Straňák and Ondřej Bojar (2021). _ParCzech 3.0: A Large Czech Speech Corpuswith Rich Metadata._