OAGK Keyword Generation Dataset
OAGK is a keyword extraction/generation dataset consisting of 2.2 million
abstracts, titles and keyword strings from scientific articles.
Texts were lowercased and tokenized with Stanford CoreNLP tokenizer.
No other preprocessing steps were applied in this release version.
Dataset records (samples) are stored as JSON lines in each text file.
This data is derived from OAG data collection
(https://aminer.org/open-academic-graph) which was released under
This data (OAGK Keyword Generation Dataset) is released under CC-BY licence
This dataset can be download from LINDAT/CLARIN repository
If using it, please cite the following paper:
Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text
Summarization Struggle, 2019 Annual Conf . . .