OAGK Keyword Generation Dataset

Name: OAGK Keyword Generation Dataset
License: http://creativecommons.org/licenses/by/4.0/

Çano, Erion

dc.contributor.author	Çano, Erion
dc.date.accessioned	2019-03-08T12:46:46Z
dc.date.available	2019-03-08T12:46:46Z
dc.date.issued	2019-04
dc.identifier.uri	http://hdl.handle.net/11234/1-2943
dc.description	OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
dc.language.iso	eng
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation	info:eu-repo/grantAgreement/EC/H2020/825460
dc.relation.isreferencedby	https://www.aclweb.org/anthology/N19-1070
dc.relation.isreplacedby	http://hdl.handle.net/11234/1-3062
dc.rights	Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	keyword extraction
dc.subject	supervised keyword generation
dc.title	OAGK Keyword Generation Dataset
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds
sponsor	European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460
size.info	2200000 entries
size.info	3 files
size.info	3.24 gb
files.size	1086288473
files.count	2

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: README.txt
Size: 2.25 KB
Format: Text file
Description: readme
MD5: dc3560f8786a522c21ea96c4fc2f5c04

Download file Preview

File Preview

OAGK Keyword Generation Dataset
===============================

OAGK is a keyword extraction/generation dataset consisting of 2.2 million
abstracts, titles and keyword strings from scientific articles. 
Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. 
No other preprocessing steps were applied in this release version.
Dataset records (samples) are stored as JSON lines in each text file. 

This data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released under 
ODC-BY licence. 

This data (OAGK Keyword Generation Dataset) is released under CC-BY licence 
(https://creativecommons.org/licenses/by/4.0/).


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-2943


Publications
------------

If using it, please cite the following paper:

Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text 
Summarization Struggle, 2019 Annual Conf . . .

Name: OAGK.zip
Size: 1.01 GB
Format: application/zip
Description: data
MD5: 92b0d028cde15184add0981349baccb4

Download file Preview

File Preview

OAGK
- oagk_train.txt2 GB
- oagk_val.txt141 MB
- oagk_test.txt239 MB

Show simple item record