Khresmoi Summary Translation Test Data for the Medical Domain version 1.1 Apr 28, 2014 Pavel Pecina 1. Description This package contains data sets for development (Section dev) and testing (Section test) of machine translation of sentences from summaries of medical articles between Czech, English, French, and German. Version 1.1 of this data set differs from version 1.0 in punctuation which was normalized using the attached script normalize-punctuation.pl. 2. Preamble 2.1 Source The original sentences are sampled from summaries of English medical documents crawled from the web in 2012 and identified to be relevant to 50 medical topics. The translations were carried out by the Charles University in Prague. 2.2 License The Khresmoi Summary Test Set is made available under the terms of the Creative Commons Attribution-Noncommercial (CC-BY-NC) license, version 3.0 unported. You may use them for academic research and all non- commercial purposes as long as the authors (cf. Section 4) are properly credited and sources acknowledged (cf. Section 6 and 7). See http://creativecommons.org/licenses/by-nc/3.0/ for a full description and explanation of the licensing terms. 4. Authors Ondrej Dušek , Jan Hajič , Jaroslava Hlaváčová , Pavel Pecina , Aleš Tamchyna , Zdeňka Urešová Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranské nám. 25 118 00 Prague 1 Czech Republic 3. Data 3.1 Description The original sentences in English were randomly selected from automatically generated summaries of documents from the CLEF 2013 eHealth Task 3 collection [1] which were found to be relevant to 50 test topics provided for the same task. Out-of-domain and ungrammatical sentences were manually removed. The sentences are provided with information on document ID and topic ID. The topic descriptions are provided as well. The sentences were translated by medical experts into Czech, French, and German and reviewed. The data sets can be used, for example, for the development and testing of machine translation in the medical domain. 3.2 Data format Translation data format follows these rules: * The data is provided in two formats: plain text and SGML. They are split according to the section (dev/test) and language (CS - Czech, DE - German, FR - French, EN - English). * All the files use the UTF-8 encoding. * The plain text files contain one sentence per line and translations are identified by line numbers. * The SGML format suits the NIST MT scoring tool. Topic description format is based on XML, each topic description () contains the following tags: - topic ID, - reference to discarge summary, - text of the query, <desc> - longer description of what the query means, <narr> - expected content of the relevant documents, <profile> - profile of the user. 3.3 Statistics (number of sentences and words for each language) -------------------------------------------------------- section sentences Czech German French English -------------------------------------------------------- dev 500 9,209 9,924 12,369 10,350 test 1,000 19,191 20,831 26,183 21,423 -------------------------------------------------------- 3.4 Files Section 1 (development data) khresmoi-summary-dev.en - original dev set sentences in English khresmoi-summary-dev.cs - Czech dev set translations khresmoi-summary-dev.de - German dev set translations khresmoi-summary-dev.fr - French dev set translations khresmoi-summary-dev.en.sgm - original dev set in English (SGML format) khresmoi-summary-dev.cs.sgm - Czech dev set translations (SGML format) khresmoi-summary-dev.de.sgm - German dev set translations (SGML format) khresmoi-summary-dev.fr.sgm - French dev set translations (SGML format) khresmoi-summary-dev.doc_id - document IDs of the dev set translations khresmoi-summary-dev.topic_id - topic IDs of the dev set translations Section 2 (test data) khresmoi-summary-test.en - original test set sentences in English khresmoi-summary-test.cs - Czech test set translations khresmoi-summary-test.de - German test set translations khresmoi-summary-test.fr - French test set translations khresmoi-summary-test.en.sgm - original test set in English (SGML format) khresmoi-summary-test.cs.sgm - Czech test set translations (SGML format) khresmoi-summary-test.de.sgm - German test set translations (SGML format) khresmoi-summary-test.fr.sgm - French test set translations (SGML format) khresmoi-summary-test.doc_id - document IDs of the test set translations khresmoi-summary-test.topic_id - topic IDs of the test set translations normalize-punctuation.pl - a script to normalize punctuation README.TXT - this file queries.clef2013ehealth.1-50.test.xml - topic descriptions 5. Acknowledgments This work was supported by the EU FP7 project Khresmoi (European Comission contract No. 257528). The language resources are distributed by the LINDAT/Clarin project of the Ministry of Education, Youth and Sports of the Czech Republic (project no. LM2010013). We thank all the data providers and copyright holders (see Section 6 and 7) for providing the source data and anonymous experts for translating the sentences. We thank the anonymous medical experts for translating and revising the data. 6. Source web pages bestbets.org, canceraustralia.nbocc.org.au, cks.nhs.uk, clinicaltrial.gov, clinicianonnet.blogspot.com, clubderevistas.blogspot.com, copd.about.com, emergencymedic.blogspot.com, en.diagnosispro.com, ghr.nlm.nih.gov, marshfieldclinic.kramesonline.com, peripheralneuropathycenter.uchicago.edu, publications.nice.org.uk, webeye.ophth.uiowa.edu, web.mit.edu, wiki.medpedia.com, www.acidreflux-heartburn-gerd.net, www.ahaf.org, www.ahrq.gov, www.allinahealth.org, www.babycenter.ca, www.back.com, www.bccancer.bc.ca, www.cadth.ca, www.cancerquest.org, www.digestive.niddk.nih.gov, www.elginhealth.on.ca, www.emedicinezone.com, www.endoatlas.com, www.fda.gov, www.ghr.nlm.nih.gov, www.guideline.gov, www.heart-vessels.com, www.hipsforyou.com, www.hon.ch, www.innovations.ahrq.gov, www.irontherapy.org, www.jamesshuggins.com, www.marshfieldclinic.org, www.medhelp.org, www.mediscuss.org, www.mgh.org, www.ncbi.nlm.nih.gov, www.netwellness.org, www.nevasic.com, www.nlm.nih.gov, www.oncolink.com, www.orpha.net, www.pathguy.com, www.pathologyoutlines.com, www.pbfluids.com, www.pediatriceducation.org, www.qualitymeasures.ahrq.gov, www.randyamy.com, www.shoulderdoc.co.uk, www.skincaredaily.com, www.thevest.com, www.totalhealth.co.uk, www.tripanswers.org, www.uptodate.com, www.vitalhealthzone.com, www.voicedoctor.net, www.waent.org, www.webmd.boots.com, www.xrayrisk.com, www.yorkyates.com, www.yourhealth.net.au 7. Copyright holders * Adaptedfrom a resource produced by Elgin St. Thomas Public Health. Distributed by LINDAT/Clarin, Czech Republic. * Agency for Healthcare Research and Quality (AHRQ) * Allcontents copyright © 2003-2013 Donna M. D'Alessandro, M.D. and Michael P. D'Alessandro, M.D. All rights reserved. * Copyright© BabyCenter, L.L.C. 2013. All rights reserved. * Copyright Shoulderdoc Ltd. * Copyright Trip database Limited. * Copyright© 1996-2012 Atlanta South Gastroenterology, P.C. All rights reserved. * Copyright© 2000-2013 BrightFocus Foundation. All rights reserved. * Copyright© 2003-2013 Donna M. D'Alessandro, M.D. and Michael P. D'Alessandro, M.D. All rights reserved. * Copyright© 2009 Boots UK Limited and WebMD UK Limited. * Copyright© 2010 AngioCalc, LLC. All rights reserved * Copyright© 2012 Allina Health System. All rights reserved. * Copyright© 2012 Marshfield Clinic. All Rights Reserved. * Copyright© 2013 BC Cancer Agency. All rights reserved. * Copyright© 2013 Medtronic. All Rights Reserved. * Copyright© 2013 UpToDate, Inc. All rights reserved. * Heart Vessels * Intellectual Property Office © Crown Copyright 2013 * Iron Therapy * Marquette General Health System. * MedHelp * National Library of Medicine (NLM) * Orphanet: an online rare disease and orphan drug data base. Copyright, INSERM 1997. Available on http://www.orpha.net. Accessed July 2013. * © Pan American Health Organization. All rights reserved. * © Thyroid Cancer Canada/Cancer de la thyroïde Canada * Undercopyright of AcidReflux-Heartburn-Gerd.net * U.S. Food and Drug Administration * © 2000-2013 Hill-Rom Services, Inc. All Rights Reserved. * © 2013 Canadian Agency for Drugs and Technologies in Health 8. References [1] L. Goeuriot, G. J. F. Jones, L. Kelly, J. Leveling, A. Hanbury, H. Müller, et al., ShARe/CLEF eHealth evaluation lab 2013, task 3: Information retrieval to address patients’ questions when reading clinical reports, in: D. T. Pamela Forner, Roberto Navigli (Ed.), CLEF 2013 Evaluation Labs and Workshop, Online Working Notes, Valencia, Spain, 2013.