This entry contain the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applied to the scripts downloading the dataset and to the RougeRAW implementation.
THE LINDAT/CLARIN PROJECT (LM2015071 and CZ.02.1.01/0.0/0.0/16_013/0001781; formerly LM2010013) IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES".
Copyright (c) 2019 UFAL MFF UK. All rights reserved.