OAGK Keyword Generation Dataset =============================== OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-2943 Publications ------------ If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA Acknowledgements ---------------- This research work was [partially] supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of Researchers at Charles University. Data Statistics --------------- OAGK (fullset) statistics: Records: 2200000 Total keyphrases: 13435189 Total title tokens: 26998352 Total abstract tokens: 499179395 Average keyphrases: 6.106903 Average title tokens: 12.271974 Average abstract tokens: 226.899725 oagk_train statistics: Records: 2000000 Total keyphrases: 11990067 Total title tokens: 24127290 Total abstract tokens: 440850430 Average keyphrases: 5.9950335 Average title tokens: 12.063645 Average abstract tokens: 220.425215 oagk_val statistics:: Records: 100000 Total keyphrases: 575022 Total title tokens: 1284088 Total abstract tokens: 21106435 Average keyphrases: 5.75022 Average title tokens: 12.84088 Average abstract tokens: 211.06435 oagk_test statistics: Records: 100000 Total keyphrases: 870100 Total title tokens: 1586974 Total abstract tokens: 37222530 Average keyphrases: 8.701 Average title tokens: 15.86974 Average abstract tokens: 372.2253