SumeCzech
Please use the following text to cite this item or export to a predefined format:
Straka, Milan; et al., 2018,
SumeCzech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-2615.
Authors
Straka, Milan ; et al.
Item identifier
Referenced by
Date issued
2018-02-13
Size
1001593 articles
Language(s)
Description
This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al.
The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format.
The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation.
Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2015071
Project name:LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.1.01/0.0/0.0/16_013/0001781
Project name:LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity
Charles University
Project code:8502/2016
Project name:GAUK 8502/2016
Charles University
Project code:1114217/2017
Project name:GAUK 1114217/2017
Univerzita Karlova (mimo GAUK)
Project code:SVV 260 453
Project name:Specifický vysokoškolský výzkum
Subject(s)
Collections
Files in this item
- Name
- sumeczech-1.0-obsolete-180213.zip
- Size
- 59.27 MB
- Format
- application/zip
- Description
- SumeCzech dataset and RougeRAW evaluation metric. NOTE that the download script in this archive is **not working anymore**, because it uses obsolete CommonCrawl download URL, and it also does not support Python 3.10+.
- MD5
- 832119c1236a5007b66d4b07676913b8

- Name
- sumeczech-1.0-update-230225.zip
- Size
- 59.27 MB
- Format
- application/zip
- Description
- Updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11.
- MD5
- 54e2c8215d8a5a4bc1733823b8e270f3


