OAGKX Keyword Generation Dataset
Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019,
OAGKX Keyword Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-3062.
Authors
Item identifier
Referenced by
Date issued
2019-10-21
Size
22674436 entries,
37 files,
27.4 gb,
8.5 gb
Language(s)
Description
OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.
The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license.
This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019
To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.2.69/0.0/0.0/16_027/0008495
Project name:OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy
European Union
Project code:H2020-ICT-2018-2-825460
Project name:ELITR - European Live Translator
Collections
Version History
This item isPublicly Available
and licensed under:
Files in this item
- Name
- README.txt
- Size
- 1.93 KB
- Format
- text/plain
- Description
- readme
- MD5
- a286e714b793d3a196864122183a7fa1

OAGKX Keyword Generation Dataset ================================ OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-3062 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019 To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file. Acknowledgements ---------------- This research work was [partially] supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of Researchers at Charles University. Statistics of OAGKX: -------------------- Total samples: 22674436 Title tokens: mean: 12.83 std: 4.86 min: 3 max: 25 total: 290841390 Abstract tokens: mean: 175.08 std: 86.45 min: 50 max: 400 total: 3969764238 Keyword tokens: mean: 11.89 std: 7.46 min: 2 max: 60 total: 269504044 No. Keywords: mean: 5.88 std: 3.12 min: 2 max: 12 total: 133295056 Abs-Tit overlap: mean: 0.7787 std: 0.1738 Abs-Key overlap: mean: 0.6769 std: 0.2462 Present Keywords: mean: 0.5265 std: 0.2832 Absent Keywords: mean: 0.4735 std: 0.2832
- Name
- oagkx.zip
- Size
- 8.51 GB
- Format
- application/zip
- Description
- data
- MD5
- 8a6475ea0d5a38c7aff97a0f5260df20

- oagkx
- part_11_0.txt11 MB
- part_3_1.txt1 GB
- part_0_1.txt900 MB
- part_13_0.txt69 MB
- part_10_0.txt873 MB
- part_2_1.txt1 GB
- part_12_0.txt11 MB
- part_5_1.txt877 MB
- part_1_1.txt867 MB
- part_14_0.txt1 GB
- part_7_1.txt120 MB
- part_4_1.txt1 GB
- part_9_1.txt867 MB
- part_6_1.txt1 GB
- part_8_1.txt541 MB
- part_0_0.txt752 MB
- part_3_0.txt1 GB
- part_5_0.txt1 GB
- part_2_0.txt1 GB
- part_7_0.txt1 GB
- part_4_0.txt1 GB
- part_1_0.txt1 GB
- part_6_0.txt789 MB
- part_9_0.txt709 MB
- part_8_0.txt561 MB
- part_11_1.txt9 MB
- part_13_1.txt108 MB
- part_10_1.txt58 MB
- part_3_2.txt437 MB
- part_0_2.txt770 MB
- part_12_1.txt9 MB
- part_5_2.txt880 MB
- part_2_2.txt345 MB
- part_4_2.txt568 MB
- part_1_2.txt759 MB
- part_14_1.txt1 GB
- part_7_2.txt311 MB

