This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

OAGKX Keyword Generation Dataset

Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019, OAGKX Keyword Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3062.
Date issued
2019-10-21
Size
22674436 entries,
37 files,
27.4 gb,
8.5 gb
Language(s)
Description
OAGKX is a keyword extraction/generation dataset consisting of 22674436 abstracts, titles and keyword strings from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGKX Keyword Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019, Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finland, Nov. 2019 To reproduce the experiments in the above paper, you can use the first 100000 lines of part_0_0.txt file.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
README.txt
Size
1.93 KB
Format
text/plain
Description
readme
MD5
a286e714b793d3a196864122183a7fa1
Preview
  File Preview
    OAGKX Keyword Generation Dataset
    ================================
    
    OAGKX is a keyword extraction/generation dataset consisting
    of 22674436 abstracts, titles and keyword strings from scientific 
    articles. The texts were lowercased and tokenized with 
    Stanford CoreNLP tokenizer. No other preprocessing steps
    were applied in this release version. Dataset records 
    (samples) are stored as JSON lines in each text file. 
    
    The data is derived from OAG data collection 
    (https://aminer.org/open-academic-graph) which was released 
    under ODC-BY license. 
    
    This data (OAGKX Keyword Generation Dataset) is released under 
    CC-BY license (https://creativecommons.org/licenses/by/4.0/). 
    
    
    Download
    --------
    
    This dataset can be download from LINDAT/CLARIN repository
    http://hdl.handle.net/11234/1-3062
    
    
    Publications
    ------------
    
    If using it, please cite the following paper:
    
    Çano Erion, Bojar Ondřej. Keyphrase Generation: A Multi-Aspect Survey. FRUCT 2019,
    Proceedings of the 25th Conference of the Open Innovations Association FRUCT, Helsinki,
    Finland, Nov. 2019
    
    To reproduce the experiments in the above paper, you can use 
    the first 100000 lines of part_0_0.txt file. 
    
    
    Acknowledgements
    ----------------
    
    This research work was [partially] supported by OP RDE project No. 
    CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of 
    Researchers at Charles University.
    
    
    Statistics of OAGKX:
    --------------------
    
    Total samples:     	22674436 
    Title tokens:    	mean: 12.83 std: 4.86 min: 3 max: 25 total: 290841390 
    Abstract tokens:  	mean: 175.08 std: 86.45 min: 50 max: 400 total: 3969764238 
    Keyword tokens:   	mean: 11.89 std: 7.46 min: 2 max: 60 total: 269504044
    No. Keywords:  		mean: 5.88 std: 3.12 min: 2 max: 12 total: 133295056  
    Abs-Tit overlap: 	mean: 0.7787 std: 0.1738 
    Abs-Key overlap: 	mean: 0.6769 std: 0.2462 
    Present Keywords: 	mean: 0.5265 std: 0.2832
    Absent Keywords: 	mean: 0.4735 std: 0.2832
    
    
Name
oagkx.zip
Size
8.51 GB
Format
application/zip
Description
data
MD5
8a6475ea0d5a38c7aff97a0f5260df20
Preview
  File Preview
  • oagkx
    • part_11_0.txt11 MB
    • part_3_1.txt1 GB
    • part_0_1.txt900 MB
    • part_13_0.txt69 MB
    • part_10_0.txt873 MB
    • part_2_1.txt1 GB
    • part_12_0.txt11 MB
    • part_5_1.txt877 MB
    • part_1_1.txt867 MB
    • part_14_0.txt1 GB
    • part_7_1.txt120 MB
    • part_4_1.txt1 GB
    • part_9_1.txt867 MB
    • part_6_1.txt1 GB
    • part_8_1.txt541 MB
    • part_0_0.txt752 MB
    • part_3_0.txt1 GB
    • part_5_0.txt1 GB
    • part_2_0.txt1 GB
    • part_7_0.txt1 GB
    • part_4_0.txt1 GB
    • part_1_0.txt1 GB
    • part_6_0.txt789 MB
    • part_9_0.txt709 MB
    • part_8_0.txt561 MB
    • part_11_1.txt9 MB
    • part_13_1.txt108 MB
    • part_10_1.txt58 MB
    • part_3_2.txt437 MB
    • part_0_2.txt770 MB
    • part_12_1.txt9 MB
    • part_5_2.txt880 MB
    • part_2_2.txt345 MB
    • part_4_2.txt568 MB
    • part_1_2.txt759 MB
    • part_14_1.txt1 GB
    • part_7_2.txt311 MB