CsEnVi Pairwise Parallel Corpora consist of Vietnamese-Czech parallel corpus and Vietnamese-English parallel corpus. The corpora were assembled from the following sources:
- OPUS, the open parallel corpus is a growing multilingual corpus of translated open source documents.
The majority of Vi-En and Vi-Cs bitexts are subtitles from movies and television series.
The nature of the bitexts are paraphrasing of each other's meaning, rather than translations.
- TED talks, a collection of short talks on various topics, given primarily in English, transcribed and with transcripts translated to other languages. In our corpus, we use 1198 talks which had English and Vietnamese transcripts available and 784 talks which had Czech and Vietnamese transcripts available in January 2015.
The size of the original corpora collected from OPUS and TED talks is as follows:
CS/VI EN/VI
Sentence 1337199/1337199 2035624/2035624
Word 9128897/12073975 16638364/17565580
Unique word 224416/68237 91905/78333
We improve the quality of the corpora in two steps: normalizing and filtering.
In the normalizing step, the corpora are cleaned based on the general format of subtitles and transcripts. For instance, sequences of dots indicate explicit continuation of subtitles across multiple time frames. The sequences of dots are distributed differently in the source and the target side. Removing the sequence of dots, along with a number of other normalization rules, improves the quality of the alignment significantly.
In the filtering step, we adapt the CzEng filtering tool [1] to filter out bad sentence pairs.
The size of cleaned corpora as published is as follows:
CS/VI EN/VI
Sentence 1091058/1091058 1113177/1091058
Word 6718184/7646701 8518711/8140876
Unique word 195446/59737 69513/58286
The corpora are used as training data in [2].
References:
[1] Ondřej Bojar, Zdeněk Žabokrtský, et al. 2012. The Joy of Parallelism with CzEng 1.0. Proceedings of LREC2012. ELRA. Istanbul, Turkey.
[2] Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75–86, ISSN 1804-0462. 9/2015
Tento článek předkládá přehled výzkumných studií o rodinných hodnotách vietnamských přistěhovalců v různých částech světa. Výsledky studií ukazují proces jejich adaptace a akulturace. Současně si studie všímají změn postojů Vietnamců v závislosti na kulturním prostředí hostitelské země včetně genderové rovnosti a uvolněnějších mezigeneračních vztahů. Přistěhovalci se snaží udržet si vietnamskou kulturu, která oceňuje úctu dětí k rodičům a k starším lidem, rovněž úspěchy ve vzdělání. Studie se zaobíraly přijatelnými způsoby, jak zkoumat a vysvětlovat život Vietnamců žijících v zámoří a jeho psychologické aspekty. Výzkumy se také zaměřovaly na porovnávání tradičních rodinných hodnot Vietnamců žijících ve Vietnamu a v zahraničí s ohledem na porozumění vietnamské kultuře do detailu a přistěhovalecké kultuře obecně., This article is based on a literature review of research studies on the family values of Vietnamese immigrants living in various parts of the world. Results support evidence for the process of adaptation and acculturation among Vietnamese immigrants. At the same time, studies also support the presence of attitude change in response to the culture present in the land of immigration, including greater gender equality and more freedom in generation relationships. Immigrants seek to retain Vietnamese culture valuing filial piety, respect for the elderly and high achievement in education in the host country. Studies reviewed employed acceptable ways to survey and explain the life and the psychological features of Vietnamese living overseas. The research reviewed different focuses on comparing traditional family values of Vietnamese living in Vietnam with those that live overseas, contributing to the understanding of Vietnamese culture in detail and immigrant’s culture in general., Mai Van Hai., and Obsahuje bibliografické odkazy
Post-WWII geopolitical changes in Indochina and Central & Eastern Europe drastically altered the international relationships of Czechoslovakia. Viet-nam became one of its partners. After the 1954 defeat of the French, the first Northern Vietnamese immigrants came to Czechoslovakia. However, after the Velvet Revolution of 1989 political agreements on cultural cooperation ended, and a return migration began. Nevertheless, the reconsolidation of democracy in the successor states of Czechoslovakia did not bring to an end the long established connection, and spontaneous individual migration started. Since then thousands of persons have come, and the Czech Republic remains one of the most desirable destinations for Vietnamese migrants. This article is the result of a qualitative survey conducted among pre-1989 returnees that was carried out in Vietnam from July 2010 to February 2011. The main task of the study is to frame the migration in a broader historical and political context, and show how the consequences and organized features of pre-1989 migration have shaped the perception of Czecho-slovakia and the returnees’ relationship with it.
We provide the Vietnamese version of the multi-lingual test set from WMT 2013 [1] competition. The Vietnamese version was manually translated from English. For completeness, this record contains the 3000 sentences in all the WMT 2013 original languages (Czech, English, French, German, Russian and Spanish), extended with our Vietnamese version. Test set is used in [2] to evaluate translation between Czech, English and Vietnamese.
References
1. http://www.statmt.org/wmt13/evaluation-task.html
2. Duc Tam Hoang and Ondřej Bojar, The Prague Bulletin of Mathematical Linguistics. Volume 104, Issue 1, Pages 75--86, ISSN 1804-0462. 9/2015