OAGK Keyword Generation Dataset

Name: OAGK Keyword Generation Dataset
License: http://creativecommons.org/licenses/by/4.0/

Çano, Erion

OAGK Keyword Generation Dataset

LINDAT / CLARIAH-CZ

Authors: Çano, Erion

Item identifier: http://hdl.handle.net/11234/1-2943

Referenced by: https://www.aclweb.org/anthology/N19-1070

Date issued: 2019-04

Type: corpus, text

Size: 2200000 entries, 3 files, 3.24 gb

Language(s): English

Description: OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Subject(s): keyword extraction supervised keyword generation

Collection(s): LINDAT / CLARIAH-CZ Data & Tools

This item is replaced by a newer submission:

http://hdl.handle.net/11234/1-3062

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)

Name: README.txt
Size: 2.25 KB
Format: Text file
Description: readme
MD5: dc3560f8786a522c21ea96c4fc2f5c04

Download file Preview

File Preview

OAGK Keyword Generation Dataset
===============================

OAGK is a keyword extraction/generation dataset consisting of 2.2 million
abstracts, titles and keyword strings from scientific articles. 
Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. 
No other preprocessing steps were applied in this release version.
Dataset records (samples) are stored as JSON lines in each text file. 

This data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released under 
ODC-BY licence. 

This data (OAGK Keyword Generation Dataset) is released under CC-BY licence 
(https://creativecommons.org/licenses/by/4.0/).


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-2943


Publications
------------

If using it, please cite the following paper:

Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text 
Summarization Struggle, 2019 Annual Conf . . .

Name: OAGK.zip
Size: 1.01 GB
Format: application/zip
Description: data
MD5: 92b0d028cde15184add0981349baccb4

Download file Preview

File Preview

OAGK
- oagk_train.txt2 GB
- oagk_val.txt141 MB
- oagk_test.txt239 MB