InCroMin 1.0: Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings
Please use the following text to cite this item or export to a predefined format:
Marko Čechovič, Natália Komorníková, Dominik Macháček, Ondřej Bojar, 2025,
InCroMin 1.0: Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-5956.
Authors
Item identifier
Project URL
Date issued
2025-07-08
Size
5 hours,
14 entries
Description
This data package contains published parts of InCroMin, a corpus of
cross-lingual dialogues with minutes and detection of misunderstandings.
InCroMin is described in a paper **Corpus of Cross-lingual Dialogues with Minutes
and Detection of Misunderstandings,** by Marko Čechovič, Natália Komorníková,
Dominik Macháček, and Ondřej Bojar. To be published in TSD 2025.
The data were created by volunteering participants, by 2-5 people in each
meeting. They were matched in a way that there are at least two groups of
people who did not understand each other's language. Their meeting was facilitated by
simultaneous speech translation tool integrated in Minuteman. The meetings were
held via a teleconferencing platform that recorded each speaker in a separate
audio track. The participants gave consent with data processing and release.
Then, their speech was automatically transcribed in their original language, and
automatically translated into English. Then, human annotators manually corrected
transcripts and translations, and deidentified audio and texts by removing
confidential information such as person names. The annotators also created minutes.
InCroMin corpus is a very useful data set intended primarily for evaluating
automatic systems that aim to facilitate cross-lingual dialogues in realistic
conditions and end-to-end. It can evaluate Automatic Speech Processing, Speech
Translation, Simultaneous Speech Translation, Quality Estimation, and Automatic
Minuting.
Publisher
Acknowledgement
NPO (EC NextGenEU RRF)
Project code: MPO 60273/24/21300/21000
Project name:CEDMO 2.0 NPO
MŠMT OP JAK Mezisektorová spolupráce
Project code: CZ.02.01.01/00/23_020/0008518
Project name: Jazykověda, umělá inteligence a jazykové a řečové technologie: od výzkumu k aplikacím
UK UNCE
Project code:UNCE24/SSH/009
Project name: Výzkum velkých textových korpusů prizmatem vícejazyčnosti a komplementárních metodologických přístupů
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- incromin-1.0.zip
- Size
- 177.13 MB
- Format
- application/zip
- Description
- MD5
- 604e9031015bde065f65b2de5e5efef3

- meetings
- ru_zh_1
- A_ru-B_zh_corrected.txt11 kB
- A_ru-B_zh.mp35 MB
- minutes.txt718 B
- fr_sk_1
- B_sk.tt.txt6 kB
- B_sk.mp39 MB
- B_en_corrected.tt.txt6 kB
- A_en_corrected.tt.txt6 kB
- minutes.txt716 B
- B_sk_audiodeident.tt.txt28 B
- A_fr_corrected.tt.txt7 kB
- A_fr.tt.txt6 kB
- A_fr.mp39 MB
- B_sk_corrected.tt.txt6 kB
- cs_ru_3
- A_ru.mp35 MB
- A_ru.tt.txt13 kB
- B_en_corrected.tt.txt12 kB
- A_ru_corrected.tt.txt13 kB
- A_en_corrected.tt.txt8 kB
- minutes.txt760 B
- B_en.tt.txt12 kB
- B_cs.mp35 MB
- B_cs_corrected.tt.txt13 kB
- B_cs.tt.txt13 kB
- cs_ru_2
- B_cs.tt.txt22 kB
- B_en_corrected.tt.txt20 kB
- A_ru.mp34 MB
- B_cs.mp34 MB
- B_cs_corrected.tt.txt21 kB
- minutes.txt1 kB
- A_en.tt.txt3 kB
- A_ru.tt.txt5 kB
- cs_ru_1
- A_ru.mp37 MB
- A_ru.tt.txt15 kB
- B_en_corrected.tt.txt17 kB
- A_en_corrected.tt.txt13 kB
- minutes.txt5 kB
- B_cs_audiodeident.tt.txt138 B
- B_cs.mp37 MB
- A_ru_audiodeident.tt.txt439 B
- B_cs_corrected.tt.txt17 kB
- B_cs.tt.txt14 kB
- cs_zh_1
- A_zh.tt.txt4 kB
- B_en_corrected.tt.txt15 kB
- minutes.txt2 kB
- B_en.tt.txt15 kB
- B_cs_audiodeident.tt.txt253 B
- B_cs.mp35 MB
- B_cs_corrected.tt.txt16 kB
- A_zh.mp35 MB
- B_cs.tt.txt16 kB
- A_en.tt.txt4 kB
- A_zh_audiodeident.tt.txt85 B
- cs_pt-BR_1
- A_cs.tt.txt10 kB
- B_en_corrected.tt.txt10 kB
- B_pt-BR.mp35 MB
- A_en_corrected.tt.txt9 kB
- minutes.txt4 kB
- B_pt-BR2pt-BR.tt.txt12 kB
- B_pt-BR_corrected.tt.txt12 kB
- A_en.tt.txt10 kB
- A_cs.mp35 MB
- A_cs_corrected.tt.txt10 kB
- cs_it_1
- B_en_corrected.tt.txt9 kB
- A_en_corrected.tt.txt7 kB
- A_it.mp35 MB
- A_it.tt.txt7 kB
- B_cs_audiodeident.tt.txt163 B
- B_cs.mp35 MB
- B_cs_corrected.tt.txt9 kB
- A_it_corrected.tt.txt7 kB
- B_cs.tt.txt9 kB
- A_it_audiodeident.tt.txt569 B
- cs_cs_es_pt_sk_1
- C_cs.mp33 MB
- E_cs_audiodeident.tt.txt56 B
- A_pt.tt.txt2 kB
- D_es.mp33 MB
- A_pt.mp33 MB
- A_pt_audiodeident.tt.txt27 B
- C_en.tt.txt1 kB
- B_sk.tt.txt2 kB
- A_en.tt.txt2 kB
- C_cs_corrected.tt.txt1 kB
- D_es.tt.txt1 kB
- D_en.tt.txt1 kB
- E_cs.mp33 MB
- E_cs_corrected.tt.txt5 kB
- B_en.tt.txt1 kB
- E_en.tt.txt3 kB
- E_cs.tt.txt3 kB
- C_cs_audiodeident.tt.txt27 B
- C_cs.tt.txt1 kB
- cs_cs_zh_1
- A_cs.tt.txt5 kB
- C_en_corrected.tt.txt10 kB
- C_zh_audiodeident.tt.txt774 B
- C_en.tt.txt10 kB
- C_zh.tt.txt9 kB
- C_zh.mp35 MB
- B_en.tt.txt3 kB
- A_cs_audiodeident.tt.txt380 B
- B_cs.mp35 MB
- B_cs_audiodeident.tt.txt385 B
- B_cs_corrected.tt.txt4 kB
- B_cs.tt.txt4 kB
- A_en.tt.txt4 kB
- A_cs.mp35 MB
- A_cs_corrected.tt.txt5 kB
- C_zh_corrected.tt.txt8 kB
- cs_hy_1
- A_hy.tt.txt5 kB
- B_en_corrected.tt.txt9 kB
- minutes.txt1 kB
- B_en.tt.txt12 kB
- B_cs.mp38 MB
- B_cs_corrected.tt.txt10 kB
- B_cs.tt.txt10 kB
- A_en.tt.txt4 kB
- A_hy.mp38 MB
- uk_vi_1
- B_vi_corrected.tt.txt6 kB
- B_en_corrected.tt.txt4 kB
- B_vi.tt.txt6 kB
- B_vi.mp34 MB
- A_en_corrected.tt.txt6 kB
- A_uk.mp34 MB
- minutes.txt2 kB
- A_uk.tt.txt8 kB
- cs_mr_1
- B_mr.mp311 MB
- B_en.tt.txt5 kB
- A_en.orig.tt.txt6 kB
- A_cs_audiodeident.tt.txt101 B
- A_cs.mp35 MB
- B_mr.tt.txt8 kB
- A_cs.tt.txt9 kB
- A_en.tt.txt6 kB
- ru_zh_2
- A_ru-B_zh_corrected.txt12 kB
- A_ru-B_zh.mp317 MB
- A_ru-B_zh.diarization_corrected.tt.txt3 kB
- minutes.txt2 kB
- ru_zh_1
-
- README.md6 kB
- metadata.ods38 kB

