Extended CLEF eHealth Test Collection for
            Cross-lingual Information Retrieval in the Medical Domain
                           version 1.0 (April 2019)


  1. Description

    This package contains an extended version of the test collection used in the 
  CLEF eHealth Information Retrieval tasks in 2013--2015. Compared to the original 
  version, it provides complete query translations into Czech, French, German, 
  Hungarian, Polish, Spanish and Swedish and additional relevance assessment. This 
  dataset is described in [5] and available from the LINDAT/CLARIN repository 
  http://hdl.handle.net/11234/1-2925.

  2. Preamble

  2.1 Source

    The data is adopted from the CLEF eHealth Information retrieval tasks 
  2013-2015 (https://sites.google.com/site/clefehealth/) organized under the CLEF 
  initiative (http://clef-initiative.eu).

  2.2 License

    The original data (document collection, queries, relevance assessments) is 
  available under the original license (see 
  http://catalog.elra.info/product_info.php?products_id=1218 and 
  https://github.com/CLEFeHealth)

    The newly added data is made available under the terms of the Creative 
  Commons Attribution-Noncommercial (CC-BY-NC) license, version 4.0 international.
  You may use them for academic research and all non-commercial purposes as long
  as the authors (cf. Authors, below) are properly credited and sources 
  acknowledged (cf. the Acknowledgment section). See 
  http://creativecommons.org/licenses/by-nc/4.0/ for a full description and 
  explanation of the licensing terms.

  3. Data

    This package contains the original English document collection (also 
  available from http://catalog.elra.info/product_info.php?products_id=1218), the 
  original queries in English and relevance assessments (also available from 
  https://github.com/CLEFeHealth), human-translated queries into 7 languages, and 
  machine-translated queries from the 7 languages back to English.

  3.1 Document collection

    See directory ./collection/.

  3.1.1. Description

    The document collection is identical to the one used CLEF eHealth 2015 IR 
  task. It includes 1,104,298 web pages in HTML that were crawled from English 
  medical and health-related websites [1].

    NOTE: The CLEF eHealth 2015 IR document collection is a subset of CLEF 
  eHealth 2013's one, where some documents were removed in 2015, 
  ./original/CLEFeHealth2013Task3/format-script-clefeHealth-task3/ehealth_task3/
  docids.ehealth contains the IDs of 2013's documents, while 
  ./collection/document_ids.txt contains the IDs of 2015's documents.

  3.1.2. Data format

    The files in the collections are given in .dat extension, each file contains 
  a set of web pages and each web page has a specific format, for example, the 
  following snippet shows the structure of .dat files:

      #UID:attra0843_12_000001
      #DATE:201209
      #URL:http://www.attract.wales.nhs.uk/answer.aspx?criteria=&qid=1005&src=0
      #CONTENT:
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>
      <html xmlns="http://www.w3.org/1999/xhtml">
      ...
      ...
      ...
      </body>
      </html>
      #EOR

    #UID represents a unique ID for each web page, #DATE is the date of fetching 
  the web page from the URL field (#URL), given in YYYYMM format. And between 
  #CONTENT and #EOR goes a raw HTML content of that web page, it is important to 
  mention that some pages are binary files (e.g pdf, docx, ppt).

  3.1.2. Files

    The document collection is available in ./collection/ 
  All valid document IDs are available in the file: ./collection/document_ids.txt

  3.2 Queries

    See directory ./queries/. 

  3.3.1 Description

    The queries in this collection are based on the official queries from the 
  CLEF eHealth Information Retrieval tasks 2013 [1], 2014 [2], 2015 [3]. The 
  queries in the 2013 and 2014 ShARe/CLEF eHealth IR task include medical terms 
  due to the way they were extracted from discharge summaries, while in the 2015 
  CLEF eHealth IR task (Task 2: User-Centered Health Information Retrieval), the 
  queries were simulating the way that lay people follow to express their medical 
  information need. This happened by describing their symptoms without any 
  knowledge of their disease.

    The test queries from 2013, 2014 and 2015 were merged together and split 
  randomly into two sets: 100 queries intended for training, and 66 queries for 
  testing. The split preserves the same distribution of year of origin, the ratio 
  of relevant/irrelevant documents, and the length of the queries. This split is 
  referred to as CUNI2017.

  3.3.2 Data format

    Queries in this package are provided in the XML format (UTF-8 encoded) and 
  contains the original content plus translations into Czech (cs), German (de), 
  English (en), Spanish (es), French (fr), Hungarian (hu), Polish (pl) and Swedish 
  (sv) where applicable. See details in [3].

  3.3.3 Files

    The queries are available in ./queries in several ways:

    a) Queries as they were released by the organizers of CLEF eHealth IR 
       tasks, these queries are named as topics.CLEF201x-eHealth.[train,set].[LANG 
       ID].xml, however, we modified the tags in these queries and made sure that all 
       of them have the same tag names, e.g topics.CLEF2015-eHealth.* has <topic> tag 
       instead of <top> in the original data, and <id> instead of <num>, same tags are 
       used in all of these files.
 
    b) Queries that are translated by us (by asking medical experts to do so), 
       this includes:
       - topics.CLEF2013-eHealth.[test].[cs,de,fr,es,hu,pl,sv]
       - topics.CLEF2014-eHealth.[test].[es,hu,pl,sv]
       - topcs.CLEF2015-eHealth.[test].[es,hu,pl,sv]

      c- Queries according to our split 
      topics.CUNI2017-eHealth.[train,test].[all languages].xml
      These queries contain only titles, and their translations into other 
      languages.

  3.3.4 Statistics

    The following table shows statistics of the query test sets. It includes the 
  average query length (it terms of number of tokens) for each language in each 
  set, where 2013-Task3 is the test set of 50 queries from ShARe/CLEF eHealth 
  2013, 2014-Task3 is the test set of 50 queries from ShARe/CLEF eHealth 2014, and 
  2015-Task2 is the test set of 66 queries from CLEF eHealth 2015 query test set.

    -----------------------------------------------------------------
     Lang/Set  2013    2014    2015 CUNI2017-Train CUNI2017CUNI-Test
    -----------------------------------------------------------------
     EN        5.02    5.18    5.48      4.32            4.15  
     CS        5.02    5.32    5.44      5.28            5.27
     DE        4.26    4.72    5.29      4.85            4.74
     ES        5.18    5.86    6.41      5.90            5.83
     FR        5.28    5.98    6.45      6.07            5.78
     HU        4.54    4.76    5.14      4.88            4.78
     PL        5.38    5.54    5.70      5.53            5.59
     SV        4.38    4.42    5.29      4.82            4.65
    -----------------------------------------------------------------

  3.3 Qrels

    See directory ./qrels/.

  3.3.1 Description

    The new relevance assessments are provided in the qrels files 
  qrels.CUNI2017-eHealth.[train,test].[bin,graded].txt. The complete dual 
  assessment can be found in qrels.CUNI2017-eHealth.all-dual1.graded.txt and 
  qrels.CUNI2017-eHealth.all-dual2.graded.txt. The relevance grades in 2013 and 
  2014 have 4 degrees: 0: irrelevant, 1: somewhat relevant, 2: relevant, and 3: 
  highly relevant, when mapping these grades into binary files, [0,1] were mapped 
  into 0 and [2,3] were mapped into 1, while in 2015, three levels where used 0: 
  irrelevant, 1: somewhat relevant and 2: relevant.

    For binarizing, 0 was mapped to 0 and [1,2] were mapped into 1. We used 
  three levels of relevance as in 2015 and followed the mapping: 0->0, [1,2]->1.

  3.3.2 Data format

    Relevance assessment in qrels files is represented in TREC format 
  (http://trec.nist.gov/), where each line is formatted as follows:

      Query_ ID  ITERATION  DOCUMENT_ID  RELEVANCY

    Query_ID refers to query id, ITERATION (feedback iteration) is not used 
  here, so it is always 0, DOCUMENT_ID refers to a retrieved document id and 
  RELEVANCY is the relevance degree [0,1] for binary qrels files and [0,1,2,3] for 
  graded qrels.

  3.3.3 Files

    The ./qrels directory contains the qrels files that were provided by CLEF 
  eHealth 2013--2015 organizers.

    It is important to mention that we changed queries ID in these files to be 
  harmonized with each other, in a way that all query IDs start with 
  clef201[3,4,5].[train.test].ID .

  3.3.4 Statistics

    The following table shows statistics of the assessment information: Average 
  of assessed documents per query, number of relevant documents and number of 
  irrelevant ones. 

    -------------------------------------------------------------------
     QREL_FILE                       AVG_DOC/Query Relevant Irrelevant
    -------------------------------------------------------------------
     qrels.CLEF2013-eHealth.test.bin.txt       97    1174     3676
     qrels.CLEF2013-eHealth.train.bin.txt      31      46      110
     qrels.CLEF2014-eHealth.test.bin.txt      136    3209     3591
     qrels.CLEF2014-eHealth.train.bin.txt      42     134       80
     qrels.CLEF2015-eHealth.test.bin.txt      183    2515     9576
     qrels.CLEF2015-eHealth.train.bin.txt      32      15      147
     qrels.COMPLETE-eHealth.all.bin.txt       229    9415    28694
     qrels.COMPLETE-eHealth.test.bin.txt      233    3667    11740
     qrels.COMPLETE-eHealth.train.bin.txt     227    5748    16954
     qrels.CUNI2017-eHealth.all.bin.txt        86    2517    11851
     qrels.CUNI2017-eHealth.all-dual1.bin.txt   4     358      466
     qrels.CUNI2017-eHealth.all-dual2.bin.txt  40     238      442
     qrels.CUNI2017-eHealth.test.bin.txt       92     944     5132
     qrels.CUNI2017-eHealth.train.bin.txt      82    1573     6719  
    -------------------------------------------------------------------

  3.4 Machine-translated queries

    See directory ./mt_queries/.

  3.4.1 Description

    Machine-translated versions of the non-English queries into English using 
  three systems (see 3.4.3).

  3.4.2 Data format

    Queries in ./mt_queries/*.xml are represented in xml format and UTF-8 
  encoding, while ./mt_queries/1000_best_list contains the verbose output of 
  Moses, this includes: alignment information, scores from the SMT system: phrase 
  translation model, the target language model, the reordering model, and word 
  penalty, also a sentence score which is a log-linear combination of all of these 
  scores. For more information about the verbose output of Moses SMT system, see 
  http://www.statmt.org/moses/?n=Moses.Tutorial#ntoc4. 

  3.4.3 Files

    The ./mt_queries directory contains the translations of the non-English 
  queries into English using three MT systems:

      a- KConnect SMT system [4] (translations from all languages into English)
      b- OnlineA - Google Translate (2016 translations from Czech, French and 
         German; 2017 translations from all the languages).
      c- OnlineB - Microsoft Bing Translator (2016 translations from Czech, 
         French and German; 2017 translations from all languages).

    The translated topics in ./mt_queries/*.xml contain the translation of the 
  titles only in the reference query files (reference query is a human-translated 
  version of an original query from English into all languages).

    All of these translations are single best translations (the first ranked 
  translation from MT system's output). 1000_best_list/[cs,de,es,fr,hu,pl,sv] 
  contain 1000 best translations for each query (in a separate file).

  3.5 Original data

    The original data as provided for CLEF eHealth 2013--2015 is available in 
  directory ./original/.

  4. Authors

    Shadi Saleh and Pavel Pecina

    Institute of Formal and Applied Linguistics
    Faculty of Mathematics and Physics
    Charles University, Prague, Czech Republic

    {saleh,pecina}@ufal.mff.cuni.cz

  5. Acknowledgments

    The language resources presented in this paper are distributed by the 
  LINDAT/CLARIN project of the Ministry of Education of the Czech Republic.

    This work was supported by the Czech Science Foundation (grant n. 
  P103/12/G084).

  6. References

  [1] Suominen, H., Salanterä, S., Velupillai, S., Chapman, W. W., Savova, G., 
  Elhadad, N., and et al.,Overview of the ShARe/CLEF eHealth evaluation lab 2013, 
  in Information Access Evaluation. Multilinguality, Multimodality, and 
  Visualization, pp.212–231, Springer, Berlin, Germany, 2013.
  [2] Goeuriot, L., Kelly, L., Li, W., Palotti, J., Pecina, P., Zuccon, G., 
  Hanbury, A., Jones, G., and Mueller, H., ShARe/CLEF eHealth evaluation lab 2014, 
  Task 3: User-centred health information retrieval, in Proceedings of CLEF 2014, 
  CEUR-WS.org, Sheffield, England, 2014.
  [3] Goeuriot, L., Kelly, L., Suominen, H., Hanlen, L., Névéol, A., Grouin, 
  C., Palotti, J. and Zuccon, G., Overview of the CLEF eHealth evaluation lab 
  2015. In International Conference of the Cross-Language Evaluation Forum for 
  European Languages (pp. 429-443). Springer, Berlin, Germany, 2015.
  [4] Dušek, O., Hajič, J., Hlaváčová, J., Novák, M., Pecina, P., Rosa, R., 
  Tamchyna, A., Urešová, Z., and Zeman, D., Machine Translation of Medical Texts 
  in the Khresmoi Project. In Proceedings of the Ninth Workshop on Statistical 
  Machine Translation, pp. 221-228, Baltimore, USA, 2014. 
  [5] Saleh S., Pecina P., An Extended CLEF eHealth Test Collection for 
  Cross-Lingual Information Retrieval in the Medical Domain. In: Azzopardi L., 
  Stein B., Fuhr N., Mayr P., Hauff C., Hiemstra D. (eds) Advances in Information 
  Retrieval. Lecture Notes in Computer Science, vol 11438. Springer, 2019.