SumeCzech-NER

Name: SumeCzech-NER
License: http://opensource.org/licenses/MPL-2.0

Marek, Petr; Müller, Štěpán

dc.contributor.author	Marek, Petr
dc.contributor.author	Müller, Štěpán
dc.date.accessioned	2021-02-03T08:30:28Z
dc.date.available	2021-02-03T08:30:28Z
dc.date.issued	2021-01-30
dc.identifier.uri	http://hdl.handle.net/11234/1-3505
dc.description	SumeCzech-NER SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset). Format The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are: - dataset: train, dev, test, oodtest - ne_abstract: list of named entity annotations of article's abstract - ne_headline: list of named entity annotations of article's headline - ne_text: list of name entity annotations of article's text - url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER Annotations We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions. Tokenization We used the following Python code for tokenization: from typing import List from nltk.tokenize import word_tokenize def tokenize(text: str) -> List[str]: for mark in ('.', ',', '?', '!', '-', '–', '/'): text = text.replace(mark, f' {mark} ') tokens = word_tokenize(text) return tokens
dc.language.iso	ces
dc.publisher	Czech Technical University in Prague
dc.rights	Mozilla Public License 2.0
dc.rights.uri	http://opensource.org/licenses/MPL-2.0
dc.subject	SumeCzech
dc.subject	named entity recognition
dc.subject	named entitity corpus
dc.subject	summarization
dc.title	SumeCzech-NER
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LRT + Open Submissions
contact.person	Petr Marek marekp17@fel.cvut.cz Czech Technical University in Prague
size.info	1000000 articles
size.info	3 gb
files.size	3194673292
files.count	4