Show simple item record

 
dc.contributor.author Hoang, Duc Tam
dc.contributor.author Bojar, Ondřej
dc.date.accessioned 2015-12-25T22:55:37Z
dc.date.available 2015-12-25T22:55:37Z
dc.date.issued 2015-11-10
dc.identifier.uri http://hdl.handle.net/11234/1-1595
dc.description CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources: - OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents. The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series. The nature of the bitexts are paraphrasing of each other's meaning, rather than translations. - TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015. The size of the original corpora collected from OPUS and TED talks is as follows: CS/VI EN/VI Sentence 1337199/1337199 2035624/2035624 Word 9128897/12073975 16638364/17565580 Unique word 224416/68237 91905/78333 We improve the quality of the corpora in two steps: normalizing and filtering. In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly. In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs. The size of cleaned corpora as published is as follows: CS/VI EN/VI Sentence 1091058/1091058 1113177/1091058 Word 6718184/7646701 8518711/8140876 Unique word 195446/59737 69513/58286 The corpora are used as training data in [2]. References: [1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey. [2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
dc.language.iso ces
dc.language.iso eng
dc.language.iso vie
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/645452
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject corpus
dc.subject Vietnamese
dc.subject parallel corpus
dc.subject Czech-Vietnamese corpus
dc.subject English-Vietnamese corpus
dc.title CsEnVi Pairwise Parallel Corpora
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Duc Tam Hoang hoangdt@comp.nus.edu.sg Charles University in Prague, UFAL
sponsor European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
files.size 1001980095
files.count 4


 Files in this item

 Download all files in item (955.56 MB)
Icon
Name
original-csvi.tmx
Size
232.66 MB
Format
Unknown
Description
Unknown
MD5
65d399e60a882a886ec99fb0ef721c4e
 Download file
Icon
Name
prepared-csvi.tmx
Size
183.41 MB
Format
Unknown
Description
Unknown
MD5
274637b1c521c5780bc87176313e89c1
 Download file
Icon
Name
prepared-envi.tmx
Size
190.91 MB
Format
Unknown
Description
Unknown
MD5
0a388811a353b869bed01ea34a8ab008
 Download file
Icon
Name
original-envi.tmx
Size
348.58 MB
Format
Unknown
Description
Unknown
MD5
cb85ebafeff92048a848f996e8f09bfe
 Download file

Show simple item record