1 - 10 of 10
Number of results to display per page
Search Results
2. Bengali Visual Genome 1.0
- Creator:
- Sen, Arghyadeep, Parida, Shantipriya, Kotwal, Ketan, Panda, Subhadarshi, Bojar, Ondřej, and Dash, Satya Ranjan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- multi-modal, neural machine translation, image captioning, Bengali captioning, and English-Bengali Multimodal Corpus
- Language:
- English and Bengali
- Description:
- Data ------- Bengali Visual Genome (BVG for short) 1.0 has similar goals as Hindi Visual Genome (HVG) 1.1: to support the Bengali language. Bengali Visual Genome 1.0 is the multi-modal dataset in Bengali for machine translation and image captioning. Bengali Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Bengali multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as HGV 1.1 has. For BVG, we manually translated these captions from English to Bengali taking the associated images into account. The manual translation is performed by the native Bengali speakers without referring to any machine translation system. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is called the ``challenge test set'' and consists of 1.4K segments. The challenge test set was created for the WAT2019 multi-modal task by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word. Dataset Formats --------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Bengali Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics --------------- The statistics of the current release are given below. Parallel Corpus Statistics -------------------------- Dataset Segments English Words Bengali Words ---------- -------- ------------- ------------- Train 28930 143115 113978 Dev 998 4922 3936 Test 1595 7853 6408 Challenge Test 1400 8186 6657 ---------- -------- ------------- ------------- Total 32923 164076 130979 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @inproceedings{hindi-visual-genome:2022, title= "{Bengali Visual Genome: A Multimodal Dataset for Machine Translation and Image Captioning}", author={Sen, Arghyadeep and Parida, Shantipriya and Kotwal, Ketan and Panda, Subhadarshi and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, editor={Satapathy, Suresh Chandra and Peer, Peter and Tang, Jinshan and Bhateja, Vikrant and Ghosh, Anumoy}, booktitle= {Intelligent Data Engineering and Analytics}, publisher= {Springer Nature Singapore}, address= {Singapore}, pages = {63--70}, isbn = {978-981-16-6624-7}, doi = {10.1007/978-981-16-6624-7_7}, }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
3. Eye-Tracking Recordings from a Pilot Study of WMT-style MT Outputs Ranking
- Creator:
- Bojar, Ondřej, Děchtěrenko, Filip, and Zelenina, Maria
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- eye-tracking and MT evaluation
- Language:
- Czech and English
- Description:
- This package contains the eye-tracker recordings of 8 subjects evaluating English-to-Czech machine translation quality using the WMT-style ranking of sentences. We provide the set of sentences evaluated, the exact screens presented to the annotators (including bounding box information for every area of interest and even for individual letters in the text) and finally the raw EyeLink II files with gaze trajectories. The description of the experiment can be found in the paper: Ondřej Bojar, Filip Děchtěrenko, Maria Zelenina. A Pilot Eye-Tracking Study of WMT-Style Ranking Evaluation. Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem”, Georg Rehm, Aljoscha Burchardt et al. (eds.). pp. 20-26. May 2016, Portorož, Slovenia. This work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement no. 645452 (QT21). This work was partially financially supported by the Government of Russian Federation, Grant 074-U01. This work has been using language resources developed, stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2010013).
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
4. GrandStaff-LMX: Linearized MusicXML Encoding of the GrandStaff Dataset
- Creator:
- Mayer, Jiří, Straka, Milan, Hajič jr., Jan, and Pecina, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- GrandStaff, pianoform scores, MusicXML, and Linearized MusicXML
- Language:
- No linguistic content
- Description:
- The GrandStaff-LMX dataset is based on the GrandStaff dataset described in the "End-to-end optical music recognition for pianoform sheet music" paper by Antonio Ríos-Vila et al., 2023, https://doi.org/10.1007/s10032-023-00432-z . The GrandStaff-LMX dataset contains MusicXML and Linearized MusicXML encodings of all systems from the original datase, suitable for evaluation with the TEDn metric. It also contains the GrandStaff official train/dev/split.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
5. HaCzech: Dataset of Handwritten Czech
- Creator:
- Procházka, Štěpán and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- htr, ocr, manuscripts, chronicles, and handwriting
- Language:
- Czech
- Description:
- The dataset of handwritten Czech text lines, sourced from two chronicles (municipal chronicles 1931-1944, school chronicles 1913-1933). The dataset comprises 25k lines machine-extracted from scanned pages, and provides manual annotation of text contents for a subset of size 2k.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
6. Hausa Visual Genome 1.0
- Creator:
- Abdulmumin, Idris, Das, Satya Ranja, Dawud, Musa Abdullahi, Parida, Shantipriya, Muhammad, Shamsuddeen Hassan, Ahmad, Ibrahim Sa'id, Panda, Subhadarshi, Bojar, Ondřej, Galadanci, Bashir Shehu, and Bello, Bello Shehu
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- multi-modal, machine translation, image captioning, image annotation, and neural machine translation
- Language:
- Hausa and English
- Description:
- Data ------- Hausa Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hausa multimodal machine translation tasks and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as the dataset Hindi Visual Genome 1.1 has. We automatically translated the English captions to Hausa and manually post-edited, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments is available for the multi-modal task. This challenge test set was created in Hindi Visual Genome by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats ----------------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hausa Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width, and Height columns indicate the rectangular region in the image described by the caption. Data Statistics -------------------- The statistics of the current release are given below. Parallel Corpus Statistics ----------------------------------- Dataset Segments English Words Hausa Words ---------- -------- ------------- ----------- Train 28930 143106 140981 Dev 998 4922 4857 Test 1595 7853 7736 Challenge Test 1400 8186 8752 ---------- -------- ------------- ----------- Total 32923 164067 162326 The word counts are approximate, prior to tokenization. Citation ----------- If you use this corpus, please cite the following paper: @InProceedings{abdulmumin-EtAl:2022:LREC, author = {Abdulmumin, Idris and Dash, Satya Ranjan and Dawud, Musa Abdullahi and Parida, Shantipriya and Muhammad, Shamsuddeen and Ahmad, Ibrahim Sa'id and Panda, Subhadarshi and Bojar, Ond{\v{r}}ej and Galadanci, Bashir Shehu and Bello, Bello Shehu}, title = "{Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation}", booktitle = {Proceedings of the Language Resources and Evaluation Conference}, month = {June}, year = {2022}, address = {Marseille, France}, publisher = {European Language Resources Association}, pages = {6471--6479}, url = {https://aclanthology.org/2022.lrec-1.694} }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
7. Hindi Visual Genome 1.0
- Creator:
- Parida, Shantipriya and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- parallel corpus, corpus, multilingual, machine translation, shared task, English-Hindi parallel corpus, image captioning, and multi-modal
- Language:
- English and Hindi
- Description:
- Data ---- Hindi Visual Genome 1.0, a multimodal dataset consisting of text and images suitable for English-to-Hindi multimodal machine translation task and multimodal research. We have selected short English segments (captions) from Visual Genome along with associated images and automatically translated them to Hindi with manual post-editing, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in a development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. Additionally, a challenge test set of 1400 segments will be released for the WAT2019 multi-modal task. This challenge test set was created by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. Dataset Formats -------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Hindi Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics ---------------- The statistics of the current release is given below. Parallel Corpus Statistics --------------------------- Dataset Segments English Words Hindi Words ------- --------- ---------------- ------------- Train 28932 143178 136722 Dev 998 4922 4695 Test 1595 7852 7535 Challenge Test 1400 8185 8665 (Released separately) ------- --------- ---------------- ------------- Total 32925 164137 157617 The word counts are approximate, prior to tokenization. Citation -------- If you use this corpus, please cite the following paper: @article{hindi-visual-genome:2019, title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}}, author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, journal={Computaci{\'o}n y Sistemas}, note={In print. Presented at CICLing 2019, La Rochelle, France}, year={2019}, }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
8. Malayalam Visual Genome 1.0
- Creator:
- Parida, Shantipriya and Bojar, Ondřej
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- multi-modal, neural machine translation, English-Malayalam Multimodal Corpus, Malayalam Image Captioning, and English-Malayalam Multimodal Translation
- Language:
- English and Malayalam
- Description:
- Data ------- Malayalam Visual Genome (MVG for short) 1.0 has similar goals as Hindi Visual Genome (HVG) 1.1: to support the Malayalam language. Malayalam Visual Genome 1.0 is the first multi-modal dataset in Malayalam for machine translation and image captioning. Malayalam Visual Genome 1.0 serves in "WAT 2021 Multi-Modal Machine Translation Task". Malayalam Visual Genome is a multimodal dataset consisting of text and images suitable for English-to-Malayalam multimodal machine translation task and multimodal research. We follow the same selection of short English segments (captions) and the associated images from Visual Genome as HGV 1.1 has. For MVG, we automatically translated these captions from English to Malayalam and manually corrected them, taking the associated images into account. The training set contains 29K segments. Further 1K and 1.6K segments are provided in development and test sets, respectively, which follow the same (random) sampling from the original Hindi Visual Genome. A third test set is called ``challenge test set'' and consists of 1.4K segments. The challenge test set was created for the WAT2019 multi-modal task by searching for (particularly) ambiguous English words based on the embedding similarity and manually selecting those where the image helps to resolve the ambiguity. The surrounding words in the sentence however also often include sufficient cues to identify the correct meaning of the ambiguous word. For MVG, we simply translated the English side of the test sets to Malayalam, again utilizing machine translation to speed up the process. Dataset Formats ---------------------- The multimodal dataset contains both text and images. The text parts of the dataset (train and test sets) are in simple tab-delimited plain text files. All the text files have seven columns as follows: Column1 - image_id Column2 - X Column3 - Y Column4 - Width Column5 - Height Column6 - English Text Column7 - Malayalam Text The image part contains the full images with the corresponding image_id as the file name. The X, Y, Width and Height columns indicate the rectangular region in the image described by the caption. Data Statistics ------------------- The statistics of the current release are given below. Parallel Corpus Statistics --------------------------------- Dataset Segments English Words Malayalam Words ---------- -------------- -------------------- ----------------- Train 28930 143112 107126 Dev 998 4922 3619 Test 1595 7853 5689 Challenge Test 1400 8186 6044 -------------------- ------------ ------------------ ------------------ Total 32923 164073 122478 The word counts are approximate, prior to tokenization. Citation ----------- If you use this corpus, please cite the following paper: @article{hindi-visual-genome:2019, title={{Hindi Visual Genome: A Dataset for Multimodal English-to-Hindi Machine Translation}}, author={Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Dash, Satya Ranjan}, journal={Computaci{\'o}n y Sistemas}, volume={23}, number={4}, pages={1499--1505}, year={2019} }
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
9. MUSCIMA++
- Creator:
- Hajič jr., Jan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- Optical Music Recognition, Music Notation, Graph-Based Representation, and Symbol Detection
- Language:
- No linguistic content
- Description:
- MUSCIMA++ is a dataset of handwritten music notation for musical symbol detection. It contains 91255 symbols, consisting of both notation primitives and higher-level notation objects, such as key signatures or time signatures. There are 23352 notes in the dataset, of which 21356 have a full notehead, 1648 have an empty notehead, and 348 are grace notes. For each annotated object in an image, we provide both the bounding box, and a pixel mask that defines exactly which pixels within the bounding box belong to the given object. Composite constructions, such as notes, are captured through explicitly annotated relationships of the notation primitives (noteheads, stems, beams...). This way, the annotation provides an explicit bridge between the low-level and high-level symbols described in Optical Music Recognition literature. MUSCIMA++ has annotations for 140 images from the CVC-MUSCIMA dataset [2], used for handwritten music notation writer identification and staff removal. CVC-MUSCIMA consists of 1000 binary images: 20 pages of music were each re-written by 50 musicians, binarized, and staves were removed. We had 7 different annotators marking musical symbols: each annotator marked one of each 20 CVC-MUSCIMA pages, with the writers selected so that the 140 images cover 2-3 images from each of the 50 CVC-MUSCIMA writers. This setup ensures maximal variability of handwriting, given the limitations in annotation resources. The MUSCIMA++ dataset is intended for musical symbol detection and classification, and for music notation reconstruction. A thorough description of its design is published on arXiv [2]: https://arxiv.org/abs/1703.04824 The full definition of the ground truth is given in the form of annotator instructions.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
10. OLiMPiC 1.0: OpenScore Lieder Linearized MusicXML Piano Corpus
- Creator:
- Mayer, Jiří, Straka, Milan, Hajič jr., Jan, and Pecina, Pavel
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- image and corpus
- Subject:
- OpenScore Lieder, pianoform scores, MusicXML, and Linearized MusicXML
- Language:
- No linguistic content
- Description:
- OLiMPiC: OpenScore Lieder Linearized MusicXML Piano Corpus is a dataset containing synthetic and scanned images of pianoform music scores. The scores and the scanned images originate from the OpenScore Lieder Corpus https://github.com/OpenScore/Lieder . OLiMPiC contains the scores in MusicXML and Linearized MusicXML encoding, suitable for evaluation with the TEDn metric. The official train/dev/test split is also provided.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB