Extended CLEF eHealth Test Collection for Cross-lingual Information Retrieval in the Medical Domain version 1.0 (April 2019) 1. Description This package contains an extended version of the test collection used in the CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original version, it provides complete query translations into Czech, French, German, Hungarian, Polish, Spanish and Swedish and additional relevance assessment. This dataset is described in [5] and available from the LINDAT/CLARIN repository http://hdl.handle.net/11234/1-2925. 2. Preamble 2.1 Source The data is adopted from the CLEF eHealth Information retrieval tasks 2013-2015 (https://sites.google.com/site/clefehealth/) organized under the CLEF initiative (http://clef-initiative.eu). 2.2 License The original data (document collection, queries, relevance assessments) is available under the original license (see http://catalog.elra.info/product_info.php?products_id=1218 and https://github.com/CLEFeHealth) The newly added data is made available under the terms of the Creative Commons Attribution-Noncommercial (CC-BY-NC) license, version 4.0 international. You may use them for academic research and all non-commercial purposes as long as the authors (cf. Authors, below) are properly credited and sources acknowledged (cf. the Acknowledgment section). See http://creativecommons.org/licenses/by-nc/4.0/ for a full description and explanation of the licensing terms. 3. Data This package contains the original English document collection (also available from http://catalog.elra.info/product_info.php?products_id=1218), the original queries in English and relevance assessments (also available from https://github.com/CLEFeHealth), human-translated queries into 7 languages, and machine-translated queries from the 7 languages back to English. 3.1 Document collection See directory ./collection/. 3.1.1. Description The document collection is identical to the one used CLEF eHealth 2015 IR task. It includes 1,104,298 web pages in HTML that were crawled from English medical and health-related websites [1]. NOTE: The CLEF eHealth 2015 IR document collection is a subset of CLEF eHealth 2013's one, where some documents were removed in 2015, ./original/CLEFeHealth2013Task3/format-script-clefeHealth-task3/ehealth_task3/ docids.ehealth contains the IDs of 2013's documents, while ./collection/document_ids.txt contains the IDs of 2015's documents. 3.1.2. Data format The files in the collections are given in .dat extension, each file contains a set of web pages and each web page has a specific format, for example, the following snippet shows the structure of .dat files: #UID:attra0843_12_000001 #DATE:201209 #URL:http://www.attract.wales.nhs.uk/answer.aspx?criteria=&qid=1005&src=0 #CONTENT: ... ... ... #EOR #UID represents a unique ID for each web page, #DATE is the date of fetching the web page from the URL field (#URL), given in YYYYMM format. And between #CONTENT and #EOR goes a raw HTML content of that web page, it is important to mention that some pages are binary files (e.g pdf, docx, ppt). 3.1.2. Files The document collection is available in ./collection/ All valid document IDs are available in the file: ./collection/document_ids.txt 3.2 Queries See directory ./queries/. 3.3.1 Description The queries in this collection are based on the official queries from the CLEF eHealth Information Retrieval tasks 2013 [1], 2014 [2], 2015 [3]. The queries in the 2013 and 2014 ShARe/CLEF eHealth IR task include medical terms due to the way they were extracted from discharge summaries, while in the 2015 CLEF eHealth IR task (Task 2: User-Centered Health Information Retrieval), the queries were simulating the way that lay people follow to express their medical information need. This happened by describing their symptoms without any knowledge of their disease. The test queries from 2013, 2014 and 2015 were merged together and split randomly into two sets: 100 queries intended for training, and 66 queries for testing. The split preserves the same distribution of year of origin, the ratio of relevant/irrelevant documents, and the length of the queries. This split is referred to as CUNI2017. 3.3.2 Data format Queries in this package are provided in the XML format (UTF-8 encoded) and contains the original content plus translations into Czech (cs), German (de), English (en), Spanish (es), French (fr), Hungarian (hu), Polish (pl) and Swedish (sv) where applicable. See details in [3]. 3.3.3 Files The queries are available in ./queries in several ways: a) Queries as they were released by the organizers of CLEF eHealth IR tasks, these queries are named as topics.CLEF201x-eHealth.[train,set].[LANG ID].xml, however, we modified the tags in these queries and made sure that all of them have the same tag names, e.g topics.CLEF2015-eHealth.* has tag instead of in the original data, and instead of , same tags are used in all of these files. b) Queries that are translated by us (by asking medical experts to do so), this includes: - topics.CLEF2013-eHealth.[test].[cs,de,fr,es,hu,pl,sv] - topics.CLEF2014-eHealth.[test].[es,hu,pl,sv] - topcs.CLEF2015-eHealth.[test].[es,hu,pl,sv] c- Queries according to our split topics.CUNI2017-eHealth.[train,test].[all languages].xml These queries contain only titles, and their translations into other languages. 3.3.4 Statistics The following table shows statistics of the query test sets. It includes the average query length (it terms of number of tokens) for each language in each set, where 2013-Task3 is the test set of 50 queries from ShARe/CLEF eHealth 2013, 2014-Task3 is the test set of 50 queries from ShARe/CLEF eHealth 2014, and 2015-Task2 is the test set of 66 queries from CLEF eHealth 2015 query test set. ----------------------------------------------------------------- Lang/Set 2013 2014 2015 CUNI2017-Train CUNI2017CUNI-Test ----------------------------------------------------------------- EN 5.02 5.18 5.48 4.32 4.15 CS 5.02 5.32 5.44 5.28 5.27 DE 4.26 4.72 5.29 4.85 4.74 ES 5.18 5.86 6.41 5.90 5.83 FR 5.28 5.98 6.45 6.07 5.78 HU 4.54 4.76 5.14 4.88 4.78 PL 5.38 5.54 5.70 5.53 5.59 SV 4.38 4.42 5.29 4.82 4.65 ----------------------------------------------------------------- 3.3 Qrels See directory ./qrels/. 3.3.1 Description The new relevance assessments are provided in the qrels files qrels.CUNI2017-eHealth.[train,test].[bin,graded].txt. The complete dual assessment can be found in qrels.CUNI2017-eHealth.all-dual1.graded.txt and qrels.CUNI2017-eHealth.all-dual2.graded.txt. The relevance grades in 2013 and 2014 have 4 degrees: 0: irrelevant, 1: somewhat relevant, 2: relevant, and 3: highly relevant, when mapping these grades into binary files, [0,1] were mapped into 0 and [2,3] were mapped into 1, while in 2015, three levels where used 0: irrelevant, 1: somewhat relevant and 2: relevant. For binarizing, 0 was mapped to 0 and [1,2] were mapped into 1. We used three levels of relevance as in 2015 and followed the mapping: 0->0, [1,2]->1. 3.3.2 Data format Relevance assessment in qrels files is represented in TREC format (http://trec.nist.gov/), where each line is formatted as follows: Query_ ID ITERATION DOCUMENT_ID RELEVANCY Query_ID refers to query id, ITERATION (feedback iteration) is not used here, so it is always 0, DOCUMENT_ID refers to a retrieved document id and RELEVANCY is the relevance degree [0,1] for binary qrels files and [0,1,2,3] for graded qrels. 3.3.3 Files The ./qrels directory contains the qrels files that were provided by CLEF eHealth 2013--2015 organizers. It is important to mention that we changed queries ID in these files to be harmonized with each other, in a way that all query IDs start with clef201[3,4,5].[train.test].ID . 3.3.4 Statistics The following table shows statistics of the assessment information: Average of assessed documents per query, number of relevant documents and number of irrelevant ones. ------------------------------------------------------------------- QREL_FILE AVG_DOC/Query Relevant Irrelevant ------------------------------------------------------------------- qrels.CLEF2013-eHealth.test.bin.txt 97 1174 3676 qrels.CLEF2013-eHealth.train.bin.txt 31 46 110 qrels.CLEF2014-eHealth.test.bin.txt 136 3209 3591 qrels.CLEF2014-eHealth.train.bin.txt 42 134 80 qrels.CLEF2015-eHealth.test.bin.txt 183 2515 9576 qrels.CLEF2015-eHealth.train.bin.txt 32 15 147 qrels.COMPLETE-eHealth.all.bin.txt 229 9415 28694 qrels.COMPLETE-eHealth.test.bin.txt 233 3667 11740 qrels.COMPLETE-eHealth.train.bin.txt 227 5748 16954 qrels.CUNI2017-eHealth.all.bin.txt 86 2517 11851 qrels.CUNI2017-eHealth.all-dual1.bin.txt 4 358 466 qrels.CUNI2017-eHealth.all-dual2.bin.txt 40 238 442 qrels.CUNI2017-eHealth.test.bin.txt 92 944 5132 qrels.CUNI2017-eHealth.train.bin.txt 82 1573 6719 ------------------------------------------------------------------- 3.4 Machine-translated queries See directory ./mt_queries/. 3.4.1 Description Machine-translated versions of the non-English queries into English using three systems (see 3.4.3). 3.4.2 Data format Queries in ./mt_queries/*.xml are represented in xml format and UTF-8 encoding, while ./mt_queries/1000_best_list contains the verbose output of Moses, this includes: alignment information, scores from the SMT system: phrase translation model, the target language model, the reordering model, and word penalty, also a sentence score which is a log-linear combination of all of these scores. For more information about the verbose output of Moses SMT system, see http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc4. 3.4.3 Files The ./mt_queries directory contains the translations of the non-English queries into English using three MT systems: a- KConnect SMT system [4] (translations from all languages into English) b- OnlineA - Google Translate (2016 translations from Czech, French and German; 2017 translations from all the languages). c- OnlineB - Microsoft Bing Translator (2016 translations from Czech, French and German; 2017 translations from all languages). The translated topics in ./mt_queries/*.xml contain the translation of the titles only in the reference query files (reference query is a human-translated version of an original query from English into all languages). All of these translations are single best translations (the first ranked translation from MT system's output). 1000_best_list/[cs,de,es,fr,hu,pl,sv] contain 1000 best translations for each query (in a separate file). 3.5 Original data The original data as provided for CLEF eHealth 2013--2015 is available in directory ./original/. 4. Authors Shadi Saleh and Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University, Prague, Czech Republic {saleh,pecina}@ufal.mff.cuni.cz 5. Acknowledgments The language resources presented in this paper are distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic. This work was supported by the Czech Science Foundation (grant n. P103/12/G084). 6. References [1] Suominen, H., Salanterä, S., Velupillai, S., Chapman, W. W., Savova, G., Elhadad, N., and et al.,Overview of the ShARe/CLEF eHealth evaluation lab 2013, in Information Access Evaluation. Multilinguality, Multimodality, and Visualization, pp.212–231, Springer, Berlin, Germany, 2013. [2] Goeuriot, L., Kelly, L., Li, W., Palotti, J., Pecina, P., Zuccon, G., Hanbury, A., Jones, G., and Mueller, H., ShARe/CLEF eHealth evaluation lab 2014, Task 3: User-centred health information retrieval, in Proceedings of CLEF 2014, CEUR-WS.org, Sheffield, England, 2014. [3] Goeuriot, L., Kelly, L., Suominen, H., Hanlen, L., Névéol, A., Grouin, C., Palotti, J. and Zuccon, G., Overview of the CLEF eHealth evaluation lab 2015. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 429-443). Springer, Berlin, Germany, 2015. [4] Dušek, O., Hajič, J., Hlaváčová, J., Novák, M., Pecina, P., Rosa, R., Tamchyna, A., Urešová, Z., and Zeman, D., Machine Translation of Medical Texts in the Khresmoi Project. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 221-228, Baltimore, USA, 2014. [5] Saleh S., Pecina P., An Extended CLEF eHealth Test Collection for Cross-Lingual Information Retrieval in the Medical Domain. In: Azzopardi L., Stein B., Fuhr N., Mayr P., Hauff C., Hiemstra D. (eds) Advances in Information Retrieval. Lecture Notes in Computer Science, vol 11438. Springer, 2019.