Show simple item record

 
dc.contributor.author Straka, Milan
dc.contributor.author Mediankin, Nikita
dc.contributor.author Kocmi, Tom
dc.contributor.author Žabokrtský, Zdeněk
dc.contributor.author Hudeček, Vojtěch
dc.contributor.author Hajič, Jan
dc.date.accessioned 2020-01-10T09:44:46Z
dc.date.available 2020-01-10T09:44:46Z
dc.date.issued 2018-02-13
dc.identifier.uri http://hdl.handle.net/11234/1-2615
dc.description This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
dc.language.iso ces
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedby https://www.aclweb.org/anthology/L18-1551.pdf
dc.rights Mozilla Public License 2.0
dc.rights.uri http://opensource.org/licenses/MPL-2.0
dc.subject summarization
dc.subject SumeCzech
dc.subject Rouge
dc.title SumeCzech
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Milan Straka straka@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.1.01/0.0/0.0/16_013/0001781 LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity nationalFunds
sponsor Charles University 8502/2016 GAUK 8502/2016 Other
sponsor Charles University 1114217/2017 GAUK 1114217/2017 Other
sponsor Univerzita Karlova (mimo GAUK) SVV 260 453 Specifický vysokoškolský výzkum nationalFunds
size.info 1001593 articles
files.size 124291485
files.count 2


 Files in this item

 Download all files in item (118.53 MB)
This item is
Publicly Available
and licensed under:
Mozilla Public License 2.0
Icon
Name
sumeczech-1.0-update-230225.zip
Size
59.27 MB
Format
application/zip
Description
Updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11.
MD5
54e2c8215d8a5a4bc1733823b8e270f3
 Download file  Preview
 File Preview  
    • downloader.py5 kB
    • LICENSE16 kB
    • README.md4 kB
    • sumeczech-1.0-index.jsonl.xz59 MB
    • rouge_raw.py4 kB
    • downloader_extractor.py5 kB
    • requirements.txt55 B
    • downloader_extractor_utils.py12 kB
Icon
Name
sumeczech-1.0-obsolete-180213.zip
Size
59.27 MB
Format
application/zip
Description
SumeCzech dataset and RougeRAW evaluation metric. NOTE that the download script in this archive is **not working anymore**, because it uses obsolete CommonCrawl download URL, and it also does not support Python 3.10+.
MD5
832119c1236a5007b66d4b07676913b8
 Download file  Preview
 File Preview  
    • downloader.py4 kB
    • LICENSE16 kB
    • README.md4 kB
    • sumeczech-1.0-index.jsonl.xz59 MB
    • rouge_raw.py4 kB
    • downloader_extractor.py5 kB
    • requirements.txt55 B
    • downloader_extractor_utils.py12 kB

Show simple item record