This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

UMC 0.1: Czech-Russian-English Multilingual Corpus

Please use the following text to cite this item or export to a predefined format:
Klyueva, Natalia and Bojar, Ondřej, 2008, UMC 0.1: Czech-Russian-English Multilingual Corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11858/00-097C-0000-0001-4909-7.
Date issued
2008-10-02
Size
1800000 words
Language(s)
Description
UMC 0.1 Czech-English-Russian is a multilingual parallel corpus of texts in Czech, Russian and English languages with automatic pairwise sentence alignments. The primary aim of UMC is to extend the set of languages covered by the corpus CzEng mainly for the purposes of machine translation. All the texts were downloaded from a single source — The Project Syndicate (Copyright: Project Syndicate 1995-2008), which contains a huge collection of high-quality news articles and commentaries. We were given the permission to use the texts for research and non-commercial purposes.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
Czech-Russian-tagged.gz
Size
24.35 MB
Format
application/x-gzip
Description
Tokenized, lemmatized and morphologically tagged data in Czech and Russian (88.093 sentences aligned one-to-one)
MD5
38d599d84181408bdadcc31c2c147140
Preview
  File Preview
    • Czech-Russian-tagged106 MB