Zobrazit minimální záznam

 
dc.contributor.author Kamran, Amir
dc.contributor.author Jawaid, Bushra
dc.contributor.author Bojar, Ondřej
dc.contributor.author Stanojevic, Milos
dc.date.accessioned 2016-03-22T12:05:39Z
dc.date.available 2016-03-22T12:05:39Z
dc.date.issued 2016-03-21
dc.identifier.uri http://hdl.handle.net/11372/LRT-1671
dc.description The item contains models to tune for the WMT16 Tuning shared task for Czech-to-English. CzEng 1.6pre (http://ufal.mff.cuni.cz/czeng/czeng16pre) corpus is used for the training of the translation models. The data is tokenized (using Moses tokenizer), lowercased and sentences longer than 60 words and shorter than 4 words are removed before training. Alignment is done using fast_align (https://github.com/clab/fast_align) and the standard Moses pipeline is used for training. Two 5-gram language models are trained using KenLM: one only using the CzEng English data and the other is trained using all available English mono data for WMT except Common Crawl. Also included are two lexicalized bidirectional reordering models, word based and hierarchical, with msd conditioned on both source and target of processed CzEng.
dc.language.iso ces
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.publisher University of Amsterdam, ILLC
dc.relation info:eu-repo/grantAgreement/EC/H2020/645452
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri http://www.statmt.org/wmt16/tuning-task/
dc.subject WMT16
dc.subject machine translation
dc.subject tuning
dc.subject baseline models
dc.subject shared task
dc.title WMT16 Tuning Shared Task Models (Czech-to-English)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Amir Kamran amirkamran@msn.com University of Amsterdam, ILLC
sponsor European Union H2020-ICT-2014-1-645452 QT21: Quality Translation 21 euFunds info:eu-repo/grantAgreement/EC/H2020/645452
sponsor Technology Foundation STW 12271 Data-Powered Domain-Specific Translation Services On Demand (DatAptor) Other
files.size 112085128958
files.count 5


 Soubory tohoto záznamu

Icon
Název
cs2en_model.tgz
Velikost
36.56 GB
Formát
application/x-gzip
Popis
Contains Lexical Models (lex.e2f and lex.f2e), Phrase Table (phrase-table.gz), word based reordering model (reordering-table.wbe-msd-bidirectional-fe.gz), hierarchical reordering model (reordering-table.hier-msd-bidirectional-fe.gz) and the moses.ini file
MD5
9f97c40bab9bbc8844b362437ead3c71
 Stáhnout soubor  Náhled
 Náhled souboru  
    • reordering-table.wbe-msd-bidirectional-fe.gz7 GB
    • moses.ini1 kB
    • lex.f2e668 MB
    • reordering-table.hier-msd-bidirectional-fe.gz7 GB
    • lex.e2f668 MB
    • phrase-table.gz21 GB
Icon
Název
wmt16.czeng.blm.en.tgz
Velikost
7.79 GB
Formát
application/x-gzip
Popis
kenlm 5-gram language model (binarized) trained only on the English side of CzEng parallel data used
MD5
dd910814d89f3bb41261ead0a95930dc
 Stáhnout soubor  Náhled
 Náhled souboru  
    • wmt16.czeng.blm.en12 GB
Icon
Název
wmt16.mono.blm.en.tgz
Velikost
60.04 GB
Formát
application/x-gzip
Popis
kenlm 5-gram language model (binarized) trained on all English mono data available for WMT except Common Crawl (see the Makefile for the details of mono data used)
MD5
a57e4fd4f43c05f826cda33cfe257eed
 Stáhnout soubor  Náhled
 Náhled souboru  
    • wmt16.mono.blm.en98 GB
Icon
Název
Makefile
Velikost
16.96 KB
Formát
Neznámý
Popis
You can recreate the models using this Makefile
MD5
5f56434491ccb9591c35d8fe20fb8aa9
 Stáhnout soubor
Icon
Název
moses.ini
Velikost
1.29 KB
Formát
Neznámý
Popis
The moses.ini file for tuning
MD5
52551ca476c84dbaf9409ca3083b02c2
 Stáhnout soubor

Zobrazit minimální záznam