This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

SumeCzech

Please use the following text to cite this item or export to a predefined format:
Straka, Milan; et al., 2018, SumeCzech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-2615.
Date issued
2018-02-13
Size
1001593 articles
Language(s)
Description
This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
sumeczech-1.0-obsolete-180213.zip
Size
59.27 MB
Format
application/zip
Description
SumeCzech dataset and RougeRAW evaluation metric. NOTE that the download script in this archive is **not working anymore**, because it uses obsolete CommonCrawl download URL, and it also does not support Python 3.10+.
MD5
832119c1236a5007b66d4b07676913b8
Preview
  File Preview
    • downloader.py4 kB
    • LICENSE16 kB
    • README.md4 kB
    • downloader_extractor.py5 kB
    • sumeczech-1.0-index.jsonl.xz59 MB
    • rouge_raw.py4 kB
    • requirements.txt55 B
    • downloader_extractor_utils.py12 kB
Name
sumeczech-1.0-update-230225.zip
Size
59.27 MB
Format
application/zip
Description
Updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11.
MD5
54e2c8215d8a5a4bc1733823b8e270f3
Preview
  File Preview
    • downloader.py5 kB
    • LICENSE16 kB
    • README.md4 kB
    • downloader_extractor.py5 kB
    • sumeczech-1.0-index.jsonl.xz59 MB
    • rouge_raw.py4 kB
    • requirements.txt55 B
    • downloader_extractor_utils.py12 kB