This small dataset contains 3 speech corpora collected using the Alex Translate telephone service (https://ufal.mff.cuni.cz/alex#alex-translate).
The "part1" and "part2" corpora contain English speech with transcriptions and Czech translations. These recordings were collected from users of the service. Part 1 contains earlier recordings, filtered to include only clean speech; Part 2 contains later recordings with no filtering applied.
The "cstest" corpus contains recordings of artificially created sentences, each containing one or more Czech names of places in the Czech Republic. These were recorded by a multinational group of students studying in Prague.
We present a test corpus of audio recordings and transcriptions of presentations of students' enterprises together with their slides and web-pages. The corpus is intended for evaluation of automatic speech recognition (ASR) systems, especially in conditions where the prior availability of in-domain vocabulary and named entities is benefitable.
The corpus consists of 39 presentations in English, each up to 90 seconds long, and slides and web-pages in Czech, Slovak, English, German, Romanian, Italian or Spanish.
The speakers are high school students from European countries with English as their second language.
We benchmark three baseline ASR systems on the corpus and show their imperfection.
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved. and This project has been sponsored by the grants GAČR P406/11/1499 and EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic)
Data
-------
Bengali Visual Genome (BVG for short) 1.0 has similar goals as Hindi Visual Genome (HVG) 1.1: to support the Bengali language. Bengali Visual Genome 1.0 is the multi-modal dataset in Bengali for machine translation and image
captioning. Bengali Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as HGV 1.1 has. For BVG, we manually translated these captions from English to Bengali taking the associated images into account. The manual translation is performed by the native Bengali speakers without referring to any machine translation system.
The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is
called the ``challenge test set'' and consists of 1.4K segments. The challenge test set was created for the WAT2019 multi-modal task by searching for (particularly) ambiguous English words based on the embedding similarity and
manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word.
Dataset Formats
---------------
The multimodal dataset contains both text and images.
The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files.
All the text files have seven columns as follows:
Column1 - image_id
Column2 - X
Column3 - Y
Column4 - Width
Column5 - Height
Column6 - English Text
Column7 - Bengali Text
The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption.
Data Statistics
---------------
The statistics of the current release are given below.
Parallel Corpus Statistics
--------------------------
Dataset Segments English Words Bengali Words
---------- -------- ------------- -------------
Train 28930 143115 113978
Dev 998 4922 3936
Test 1595 7853 6408
Challenge Test 1400 8186 6657
---------- -------- ------------- -------------
Total 32923 164076 130979
The word counts are approximate, prior to tokenization.
Citation
--------
If you use this corpus, please cite the following paper:
@inproceedings{hindi-visual-genome:2022,
title= "{Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning}",
author={Sen, Arghyadeep
and Parida, Shantipriya
and Kotwal, Ketan
and Panda, Subhadarshi
and Bojar, Ond{\v{r}}ej
and Dash, Satya Ranjan},
editor={Satapathy, Suresh Chandra
and Peer, Peter
and Tang, Jinshan
and Bhateja, Vikrant
and Ghosh, Anumoy},
booktitle= {Intelligent Data Engineering and Analytics},
publisher= {Springer Nature Singapore},
address= {Singapore},
pages = {63--70},
isbn = {978-981-16-6624-7},
doi = {10.1007/978-981-16-6624-7_7},
}
COSTRA 1.0 is a dataset of Czech complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing.
The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space.
Costra 1.1 is a new dataset for testing geometric properties of sentence embeddings spaces. In particular, it concentrates on examining how well sentence embeddings capture complex phenomena such paraphrases, tense or generalization. The dataset is a direct expansion of Costra 1.0, which was extended with more sentences and sentence comparisons.
CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:
- OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents.
The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series.
The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.
- TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.
The size of the original corpora collected from OPUS and TED talks is as follows:
CS/VI EN/VI
Sentence 1337199/1337199 2035624/2035624
Word 9128897/12073975 16638364/17565580
Unique word 224416/68237 91905/78333
We improve the quality of the corpora in two steps: normalizing and filtering.
In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly.
In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.
The size of cleaned corpora as published is as follows:
CS/VI EN/VI
Sentence 1091058/1091058 1113177/1091058
Word 6718184/7646701 8518711/8140876
Unique word 195446/59737 69513/58286
The corpora are used as training data in [2].
References:
[1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey.
[2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
CUBBITT En-Cs translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->cs: 27.6
cs->en: 34.4
(Evaluated using multeval: https://github.com/jhclark/multeval)
CUBBITT En-Fr translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/).
Models are compatible with Tensor2tensor version 1.6.6.
For details about the model training (data, model hyper-parameters), please contact the archive maintainer.
Evaluation on newstest2014 (BLEU):
en->fr: 38.2
fr->en: 36.7
(Evaluated using multeval: https://github.com/jhclark/multeval)