This entry contain the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applied to the scripts downloading the dataset and to the RougeRAW implementation.
THE LINDAT/CLARIAH-CZ PROJECT (LM2018101; which is a direct legal successor of the LINDAT/CLARIN projects LM2010013 and LM2015071) IS FULLY SUPPORTED BY THE MINISTRY OF EDUCATION, SPORTS AND YOUTH OF THE CZECH REPUBLIC UNDER THE PROGRAMME LM OF "LARGE INFRASTRUCTURES".
Copyright (c) 2020 UFAL MFF UK. All rights reserved.