Show simple item record

 
dc.contributor.author Marek, Petr
dc.contributor.author Müller, Štěpán
dc.date.accessioned 2021-02-03T08:30:28Z
dc.date.available 2021-02-03T08:30:28Z
dc.date.issued 2021-01-30
dc.identifier.uri http://hdl.handle.net/11234/1-3505
dc.description SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens
dc.language.iso ces
dc.publisher Czech Technical University in Prague
dc.rights Mozilla Public License 2.0
dc.rights.uri http://opensource.org/licenses/MPL-2.0
dc.subject SumeCzech
dc.subject named entity recognition
dc.subject named entitity corpus
dc.subject summarization
dc.title SumeCzech-NER
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Petr Marek marekp17@fel.cvut.cz Czech Technical University in Prague
size.info 1000000 articles
size.info 3 gb
files.size 3194673292
files.count 4


 Files in this item

This item is
Publicly Available
and licensed under:
Mozilla Public License 2.0
Icon
Name
sumeczech-1.0-ner-0.jsonl
Size
914.9 MB
Format
Unknown
Description
Part 0
MD5
2854d4cdcfe412a4096966058974ab2a
 Download file
Icon
Name
sumeczech-1.0-ner-1.jsonl
Size
915.31 MB
Format
Unknown
Description
Part 1
MD5
c42fceef132222c193c11699e1da0c1f
 Download file
Icon
Name
sumeczech-1.0-ner-2.jsonl
Size
914.44 MB
Format
Unknown
Description
Part 2
MD5
ee32af1e21886dfc3378c7a57d1e790e
 Download file
Icon
Name
sumeczech-1.0-ner-3.jsonl
Size
302.03 MB
Format
Unknown
Description
Part 3
MD5
f1edf01f70a23253fe6cf9012a44469c
 Download file

Show simple item record