This is not the latest version of this item. The latest version can be found here.
OAGK Keyword Generation Dataset
Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019,
OAGK Keyword Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-2943.
Authors
Item identifier
Referenced by
Date issued
2019-04
Size
2200000 entries,
3 files,
3.24 gb
Language(s)
Description
OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.
This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence.
This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.2.69/0.0/0.0/16_027/0008495
Project name:OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy
European Union
Project code:H2020-ICT-2018-2-825460
Project name:ELITR - European Live Translator
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- README.txt
- Size
- 2.25 KB
- Format
- text/plain
- Description
- Text
- MD5
- dc3560f8786a522c21ea96c4fc2f5c04

OAGK Keyword Generation Dataset =============================== OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-2943 Publications ------------ If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA Acknowledgements ---------------- This research work was [partially] supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of Researchers at Charles University. Data Statistics --------------- OAGK (fullset) statistics: Records: 2200000 Total keyphrases: 13435189 Total title tokens: 26998352 Total abstract tokens: 499179395 Average keyphrases: 6.106903 Average title tokens: 12.271974 Average abstract tokens: 226.899725 oagk_train statistics: Records: 2000000 Total keyphrases: 11990067 Total title tokens: 24127290 Total abstract tokens: 440850430 Average keyphrases: 5.9950335 Average title tokens: 12.063645 Average abstract tokens: 220.425215 oagk_val statistics:: Records: 100000 Total keyphrases: 575022 Total title tokens: 1284088 Total abstract tokens: 21106435 Average keyphrases: 5.75022 Average title tokens: 12.84088 Average abstract tokens: 21 . . .
- Name
- OAGK.zip
- Size
- 1.01 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 92b0d028cde15184add0981349baccb4

- OAGK
- oagk_train.txt2 GB
- oagk_val.txt141 MB
- oagk_test.txt239 MB

