Show simple item record

 
dc.contributor.author Kamran, Amir
dc.contributor.author Jawaid, Bushra
dc.contributor.author Bojar, Ondřej
dc.contributor.author Stanojevic, Milos
dc.date.accessioned 2016-03-22T12:05:39Z
dc.date.available 2016-03-22T12:05:39Z
dc.date.issued 2016-03-21
dc.identifier.uri http://hdl.handle.net/11372/LRT-1671
dc.description The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
dc.language.iso ces
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher University of Amsterdam, ILLC
dc.relation info:eu-repo/grantAgreement/EC/H2020/645452
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri http://www.statmt.org/wmt16/tuning-task/
dc.subject WMT16
dc.subject machine translation
dc.subject tuning
dc.subject baseline models
dc.subject shared task
dc.title WMT16 Tuning Shared Task Models (Czech-to-English)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Amir Kamran amirkamran@msn.com University of Amsterdam, ILLC
sponsor European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
sponsor Technology Foundation STW 12271 Data-Powered Domain-Specific Translation Services On Demand (DatAptor) Other
files.size 112085128958
files.count 5


 Files in this item

Icon
Name
cs2en_model.tgz
Size
36.56 GB
Format
application/x-gzip
Description
Contains Lexical Models (lex.e2f and lex.f2e), Phrase Table (phrase-table.gz), word based reordering model (reordering-table.wbe-msd-bidirectional-fe.gz), hierarchical reordering model (reordering-table.hier-msd-bidirectional-fe.gz) and the moses.ini file
MD5
9f97c40bab9bbc8844b362437ead3c71
 Download file  Preview
 File Preview  
    • reordering-table.wbe-msd-bidirectional-fe.gz7 GB
    • moses.ini1 kB
    • lex.f2e668 MB
    • reordering-table.hier-msd-bidirectional-fe.gz7 GB
    • lex.e2f668 MB
    • phrase-table.gz21 GB
Icon
Name
wmt16.czeng.blm.en.tgz
Size
7.79 GB
Format
application/x-gzip
Description
kenlm 5-gram language model (binarized) trained only on the English side of CzEng parallel data used
MD5
dd910814d89f3bb41261ead0a95930dc
 Download file  Preview
 File Preview  
    • wmt16.czeng.blm.en12 GB
Icon
Name
wmt16.mono.blm.en.tgz
Size
60.04 GB
Format
application/x-gzip
Description
kenlm 5-gram language model (binarized) trained on all English mono data available for WMT except Common Crawl (see the Makefile for the details of mono data used)
MD5
a57e4fd4f43c05f826cda33cfe257eed
 Download file  Preview
 File Preview  
    • wmt16.mono.blm.en98 GB
Icon
Name
Makefile
Size
16.96 KB
Format
Unknown
Description
You can recreate the models using this Makefile
MD5
5f56434491ccb9591c35d8fe20fb8aa9
 Download file
Icon
Name
moses.ini
Size
1.29 KB
Format
Unknown
Description
The moses.ini file for tuning
MD5
52551ca476c84dbaf9409ca3083b02c2
 Download file

Show simple item record