Zobrazit minimální záznam
dc.contributor.author |
Marek, Petr |
dc.contributor.author |
Müller, Štěpán |
dc.date.accessioned |
2021-02-03T08:30:28Z |
dc.date.available |
2021-02-03T08:30:28Z |
dc.date.issued |
2021-01-30 |
dc.identifier.uri |
http://hdl.handle.net/11234/1-3505 |
dc.description |
SumeCzech-NER
SumeCzech-NER contains named entity annotations of SumeCzech 1.0 (Straka et al. 2018, SumeCzech: Large Czech News-Based Summarization Dataset).
Format
The dataset is split into four files. Files are in jsonl format. There is one JSON object on each line of the file. The most important fields of JSON objects are:
- dataset: train, dev, test, oodtest
- ne_abstract: list of named entity annotations of article's abstract
- ne_headline: list of named entity annotations of article's headline
- ne_text: list of name entity annotations of article's text
- url: article's URL that can be used to match article across SumeCzech and SumeCzech-NER
Annotations
We used SpaCy's NER model trained on CoNLL-based extended CNEC 2.0. The model achieved a 78.45 F-Score on the dataset's testing set. The annotations are in IOB2 format. The entity types are: Numbers in addresses, Geographical names, Institutions, Media names, Artifact names, Personal names, and Time expressions.
Tokenization
We used the following Python code for tokenization:
from typing import List
from nltk.tokenize import word_tokenize
def tokenize(text: str) -> List[str]:
for mark in ('.', ',', '?', '!', '-', '–', '/'):
text = text.replace(mark, f' {mark} ')
tokens = word_tokenize(text)
return tokens |
dc.language.iso |
ces |
dc.publisher |
Czech Technical University in Prague |
dc.rights |
Mozilla Public License 2.0 |
dc.rights.uri |
http://opensource.org/licenses/MPL-2.0 |
dc.subject |
SumeCzech |
dc.subject |
named entity recognition |
dc.subject |
named entitity corpus |
dc.subject |
summarization |
dc.title |
SumeCzech-NER |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
dc.rights.label |
PUB |
has.files |
yes |
branding |
LRT + Open Submissions |
contact.person |
Petr Marek marekp17@fel.cvut.cz Czech Technical University in Prague |
size.info |
1000000 articles |
size.info |
3 gb |
files.size |
3194673292 |
files.count |
4 |
Soubory tohoto záznamu
Licenční kategorie:
Publicly Available
Licence:
Mozilla Public License 2.0
- Název
- sumeczech-1.0-ner-0.jsonl
- Velikost
- 914.9
MB
- Formát
- Neznámý
- Popis
- Part 0
- MD5
- 2854d4cdcfe412a4096966058974ab2a
Stáhnout soubor
- Název
- sumeczech-1.0-ner-1.jsonl
- Velikost
- 915.31
MB
- Formát
- Neznámý
- Popis
- Part 1
- MD5
- c42fceef132222c193c11699e1da0c1f
Stáhnout soubor
- Název
- sumeczech-1.0-ner-2.jsonl
- Velikost
- 914.44
MB
- Formát
- Neznámý
- Popis
- Part 2
- MD5
- ee32af1e21886dfc3378c7a57d1e790e
Stáhnout soubor
- Název
- sumeczech-1.0-ner-3.jsonl
- Velikost
- 302.03
MB
- Formát
- Neznámý
- Popis
- Part 3
- MD5
- f1edf01f70a23253fe6cf9012a44469c
Stáhnout soubor
Zobrazit minimální záznam