This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation.
Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
dc.language.iso
ces
dc.publisher
Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedby
https://www.aclweb.org/anthology/L18-1551.pdf
dc.rights
Mozilla Public License 2.0
dc.rights.uri
http://opensource.org/licenses/MPL-2.0
dc.subject
summarization
dc.subject
SumeCzech
dc.subject
Rouge
dc.title
SumeCzech
dc.type
corpus
metashare.ResourceInfo#ContentInfo.mediaType
text
dc.rights.label
PUB
has.files
yes
branding
LINDAT / CLARIAH-CZ
contact.person
Milan Straka straka@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor
Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
sponsor
Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.1.01/0.0/0.0/16_013/0001781 LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity nationalFunds
sponsor
Charles University 8502/2016 GAUK 8502/2016 Other
sponsor
Charles University 1114217/2017 GAUK 1114217/2017 Other
sponsor
Univerzita Karlova (mimo GAUK) SVV 260 453 Specifický vysokoškolský výzkum nationalFunds
Updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11.
SumeCzech dataset and RougeRAW evaluation metric. NOTE that the download script in this archive is **not working anymore**, because it uses obsolete CommonCrawl download URL, and it also does not support Python 3.10+.