Show simple item record

 
dc.contributor.author Kamran, Amir
dc.contributor.author Jawaid, Bushra
dc.contributor.author Bojar, Ondřej
dc.contributor.author Stanojevic, Milos
dc.date.accessioned 2016-03-22T12:33:39Z
dc.date.available 2016-03-22T12:33:39Z
dc.date.issued 2016-03-21
dc.identifier.uri http://hdl.handle.net/11372/LRT-1672
dc.description This item contains models to tune for the WMT16 Tuning shared task for English-to-Czech. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng Czech data and the other is trained using all available Czech mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
dc.language.iso eng
dc.language.iso ces
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher University of Amsterdam, ILLC
dc.relation info:eu-repo/grantAgreement/EC/H2020/645452
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri http://www.statmt.org/wmt16/tuning-task/
dc.subject WMT16
dc.subject machine translation
dc.subject tuning
dc.subject baseline models
dc.subject shared task
dc.title WMT16 Tuning Shared Task Models (English-to-Czech)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Amir Kamran amirkamran@msn.com University of Amsterdam, ILLC
sponsor European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
sponsor Technology Foundation STW 12271 Data-Powered Domain-Specific Translation Services On Demand (DatAptor) Other
files.size 70667353277
files.count 5


 Files in this item

Icon
Name
en2cs_model.tgz
Size
37.56 GB
Format
application/x-gzip
Description
Contains Lexical Models (lex.e2f and lex.f2e), Phrase Table (phrase-table.gz), word based reordering model (reordering-table.wbe-msd-bidirectional-fe.gz), hierarchical reordering model (reordering-table.hier-msd-bidirectional-fe.gz) and the moses.ini file
MD5
bed88bbeef3afc454c3f02845ab72769
 Download file  Preview
 File Preview  
    • reordering-table.wbe-msd-bidirectional-fe.gz7 GB
    • moses.ini1 kB
    • lex.f2e668 MB
    • reordering-table.hier-msd-bidirectional-fe.gz7 GB
    • lex.e2f668 MB
    • phrase-table.gz21 GB
Icon
Name
wmt16.czeng.blm.cs.tgz
Size
9.28 GB
Format
application/x-gzip
Description
kenlm 5-gram language model (binarized) trained only on the Czech side of CzEng parallel data used
MD5
de338fd4ba04b82631aab9488c468cd6
 Download file  Preview
 File Preview  
    • wmt16.czeng.blm.cs15 GB
Icon
Name
wmt16.mono.blm.cs.tgz
Size
18.98 GB
Format
application/x-gzip
Description
kenlm 5-gram language model (binarized) trained on all Czech mono data available for WMT except Common Crawl (see the Makefile for the details of mono data used)
MD5
6347aa8e420db60142cb1384ea1cab0d
 Download file  Preview
 File Preview  
    • wmt16.mono.blm.cs32 GB
Icon
Name
Makefile
Size
16.96 KB
Format
Unknown
Description
You can recreate the models using this Makefile
MD5
5f56434491ccb9591c35d8fe20fb8aa9
 Download file
Icon
Name
moses.ini
Size
1.29 KB
Format
Unknown
Description
The moses.ini file for tuning
MD5
8c60e67f303419ad03fee1fe00aef8cf
 Download file

Show simple item record