Training, development and test data consist in German sentences belonging to the IT domain and already tokenized. These sentences are the references of the data released for the 2016 edition of the WMT APE shared task. Differently from the data previously released, these sentences are obtained by manually translating the source sentence without leveraging the raw mt outputs. Training and development respectively contain 12,000 and 1,000 segments, while the test set 2,000 items. All data is provided by the EU project QT21 (http://www.qt21.eu/).
The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech.
CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training.
Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl.
Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
Training and development data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in German-English triplets (source, target and post-edit) belonging to the pharmacological domain and already tokenized. Training and development respectively contain 25,000 and 1,000 triplets. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Training data for the WMT 2017 Automatic post-editing task (the same used for the Sentence-level Quality Estimation task). They consist in 11,000 English-German triplets (source, target and post-edit) belonging to the IT domain and already tokenized. All data is provided by the EU project QT21 (http://www.qt21.eu/).
Training and development data for the WMT 2018 Automatic post-editing task. They consist in English-German triplets (source, target and post-edit) belonging to the information technology domain and already tokenized. Training and development respectively contain 13,442 and 1,000 triplets. A neural machine translation system has been used to generate the target segments. All data is provided by the EU project QT21 (http://www.qt21.eu/).