This is Synthetic part of CzEng 2.0: Czech-English parallel corpus. More details and citing information: https://ufal.mff.cuni.cz/czeng/czeng20 @article{kocmi2020announcing, title={Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords}, author={Tom Kocmi and Martin Popel and Ondrej Bojar}, year={2020}, journal={arXiv preprint arXiv:2007.03006}, } CzEng is a sentence-parallel Czech-English corpus compiled at the Institute of Formal and Applied Linguistics (ÚFAL). While the full CzEng 2.0 is freely available for non-commercial research purposes from the project website (https://ufal.mff.cuni.cz/czeng), this release contains only the original monolingual parts of news text (csmono 53M and enmono 79M sentences) with automatic (synthetic) translations by CUBBITT. It contains the following files: csmono - filtered Czech news crawl from 2013-2018 [1], translated to English by CUBBITT [3] enmono - filtered English news crawl from 2016-2018 [2], translated to Czech by CUBBITT [3] FILE FORMAT: Each file contains the following six tab-separated columns: 1. ID - unique ID for each sentence pair, the last segment starting with "s" distinguishes sentences within the same document 2. adq_score - computed as Dual conditional cross-entropy filtering [4] 3. cs_lang_score - p(lang=Czech)/p(lang=x), where p are the probabilities assigned by FastText [5] to a given sentence and x is the most probable language 4. en_lang_score - p(lang=English)/p(lang=x) 5. Czech sentence 6. English sentence All three scores are within 0 and 1 and higher values mean better scores (cleaner sentence pairs). Documents are separated by empty lines. All the data are document-level deduplicated and shuffled. FILTERING: After document-level deduplication, we deleted - sentences longer than 200 (space-separated) words or 1600 characters - sentences with cs_lang_score<0.5 or en_lang_score<0.5 (with more than 10 words in CS or EN sentence) - sentences with adq_score<0.02 (score is not computed for csmoo a enmono) For the synthetic data (csmono and enmono), we set adq_score to 1.0 for all sentences. NOTES: If you want a smaller and cleaner corpus, you may consider - further filtering (sentence level or document level) based on the provided scores. REFERENCES: [1] http://data.statmt.org/news-crawl/cs-doc/ [2] http://data.statmt.org/news-crawl/en-doc/ [3] Martin Popel. "CUNI Transformer Neural MT System for WMT18" (2018). https://www.aclweb.org/anthology/W18-6424/ [4] Marcin Junczys-Dowmunt. "Dual conditional cross-entropy filtering of noisy parallel corpora." (2018). https://www.aclweb.org/anthology/W18-6478/ [5] https://fasttext.cc/blog/2017/10/02/blog-post.html [6] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovický, Michal Novák, Martin Popel, Roman Sudarikov, Dušan Variš. "CzEng 1.6: Enlarged Czech-English Parallel Corpus with Processing Tools Dockered." http://link.springer.com/chapter/10.1007/978-3-319-45510-5_27 [7] https://tilde-model.s3-eu-west-1.amazonaws.com/Tilde_MODEL_Corpus.html