We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable.
The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish.
The speakers are high school students from European countries with English as their second language.
We benchmark three baseline ASR systems on the corpus and show their imperfection.
COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing.
The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->cs: 27.6
cs->en: 34.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->fr: 38.2
fr->en: 36.7
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2020 (BLEU):
en->pl: 12.3
pl->en: 20.0
(Evaluated using multeval: https://github.com/jhclark/multeval)
ELITR Minuting Corpus consists of transcripts of meetings in Czech and English, their manually created summaries ("minutes") and manual alignments between the two.
Czech meetings are in the computer science and public administration domains and English meetings are in the computer science domain.
Each transcript has one or multiple corresponding minutes files. Alignments are only provided for a portion of the data.
This corpus contains 59 Czech and 120 English meeting transcripts, consisting of 71097 and 87322 dialogue turns respectively. For Czech meetings, we provide 147 total minutes with 55 of them aligned. For English meetings, it is 256 total minutes with 111 of them aligned.
Please find a more detailed description of the data in the included README and stats.tsv files.
If you use this corpus, please cite:
Nedoluzhko, A., Singh, M., Hledíková, M., Ghosal, T., and Bojar, O.
(2022). ELITR Minuting Corpus: A novel dataset for automatic minuting
from multi-party meetings in English and Czech. In Proceedings of the
13th International Conference on Language Resources and Evaluation
(LREC-2022), Marseille, France, June. European Language Resources
Association (ELRA). In print.
@inproceedings{elitr-minuting-corpus:2022,
author = {Anna Nedoluzhko and Muskaan Singh and Marie
Hled{\'{\i}}kov{\'{a}} and Tirthankar Ghosal and Ond{\v{r}}ej Bojar},
title = {{ELITR} {M}inuting {C}orpus: {A} Novel Dataset for
Automatic Minuting from Multi-Party Meetings in {E}nglish and {C}zech},
booktitle = {Proceedings of the 13th International Conference
on Language Resources and Evaluation (LREC-2022)},
year = 2022,
month = {June},
address = {Marseille, France},
publisher = {European Language Resources Association (ELRA)},
note = {In print.}
}
Eyetracked Multi-Modal Translation (EMMT) is a simultaneous eye-tracking, 4-electrode EEG and audio corpus for multi-modal reading and translation scenarios. It contains monocular eye movement recordings, audio data and 4-electrode wearable electroencephalogram (EEG) data of 43 participants while engaged in sight translation supported by an image.
The details about the experiment and the dataset can be found in the README file.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
The current version of ESIC is v1.0. It has validation and evaluation parts.
ESIC (Europarl Simultaneous Interpreting Corpus) is a corpus of 370 speeches (10 hours) in English, with manual transcripts, transcribed simultaneous interpreting into Czech and German, and parallel translations.
The corpus contains source English videos and audios. The interpreters' voices are not published within the corpus, but there is a tool that downloads them from the web of European Parliament, where they are publicly avaiable.
The transcripts are equipped with metadata (disfluencies, mixing voices and languages, read or spontaneous speech, etc.), punctuated, and with word-level timestamps.
The speeches in the corpus come from the European Parliament plenary sessions, from the period 2008-11. Most of the speakers are MEP, both native and non-native speakers of English. The corpus contains metadata about the speakers (name, surname, id, fraction) and about the speech (date, topic, read or spontaneous).
ESIC has validation and evaluation parts.
The current version is ESIC v1.1, it extends v1.0 with manual sentence alignment of the tri-parallel texts, and with bi-parallel sentence alignment of English original transcripts and German interpreting.