Zobrazit minimální záznam

 
dc.contributor.author Kloudová, Věra
dc.contributor.author Mraček, David
dc.contributor.author Bojar, Ondřej
dc.contributor.author Popel, Martin
dc.date.accessioned 2023-04-28T08:42:34Z
dc.date.available 2023-04-28T08:42:34Z
dc.date.issued 2023-04-26
dc.identifier.uri http://hdl.handle.net/11234/1-5141
dc.description We define "optimal reference translation" as a translation thought to be the best possible that can be achieved by a team of human translators. Optimal reference translations can be used in assessments of excellent machine translations. We selected 50 documents (online news articles, with 579 paragraphs in total) from the 130 English documents included in the WMT2020 news test (http://www.statmt.org/wmt20/) with the aim to preserve diversity (style, genre etc.) of the selection. In addition to the official Czech reference translation provided by the WMT organizers (P1), we hired two additional translators (P2 and P3, native Czech speakers) via a professional translation agency, resulting in three independent translations. The main contribution of this dataset are two additional translations (i.e. optimal reference translations N1 and N2), done jointly by two translators-cum-theoreticians with an extreme care for various aspects of translation quality, while taking into account the translations P1-P3. We publish also internal comments (in Czech) for some of the segments. Translation N1 should be closer to the English original (with regards to the meaning and linguistic structure) and female surnames use the Czech feminine suffix (e.g. "Mai" is translated as "Maiová"). Translation N2 is more free, trying to be more creative, idiomatic and entertaining for the readers and following the typical style used in Czech media, while still preserving the rules of functional equivalence. Translation N2 is missing for the segments where it was not deemed necessary to provide two alternative translations. For applications/analyses needing translation of all segments, this should be interpreted as if N2 is the same as N1 for a given segment. We provide the dataset in two formats: OpenDocument spreadsheet (odt) and plain text (one file for each translation and the English original). Some words were highlighted using different colors during the creation of optimal reference translations; this highlighting and comments are present only in the odt format (some comments refer to row numbers in the odt file). Documents are separated by empty lines and each document starts with a special line containing the document name (e.g. "# upi.205735"), which allows alignment with the original WMT2020 news test. For the segments where N2 translations are missing in the odt format, the respective N1 segments are used instead in the plain-text format.
dc.language.iso ces
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject translational equivalence
dc.subject reference translation
dc.subject optimal reference translation
dc.subject WMT
dc.title Optimal reference translation of English-Czech WMT2020
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Martin Popel popel@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Czech Science Foundation 19-26934X Neural Representations in Multi-modal and Multi-lingual Modelling nationalFunds
sponsor Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
size.info 50 articles
size.info 579 texts
files.size 577663
files.count 1


 Soubory tohoto záznamu

Icon
Název
optimal-ref-translation-en-cs-wmt20.zip
Velikost
564.12 KB
Formát
application/zip
MD5
262b57951800300803821c6c73e0b1f6
 Stáhnout soubor  Náhled
 Náhled souboru  
    • translation-P2.txt113 kB
    • translation-N2.txt124 kB
    • source-english.txt104 kB
    • translation-P1.txt118 kB
    • translation-N1.txt122 kB
    • optimal-ref-translation-en-cs-wmt20.ods275 kB
    • translation-P3.txt112 kB

Zobrazit minimální záznam