This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

OAGK Keyword Generation Dataset

Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019, OAGK Keyword Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-2943.
Date issued
2019-04
Size
2200000 entries,
3 files,
3.24 gb
Language(s)
Description
OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
Acknowledgement

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
2019-10-21 00:00:00
1*
2019-04-01 00:00:00
* Selected version
This item isPublicly Available
and licensed under:
 Files in this item
Name
README.txt
Size
2.25 KB
Format
text/plain
Description
Text
MD5
dc3560f8786a522c21ea96c4fc2f5c04
Preview
  File Preview
    OAGK Keyword Generation Dataset
    ===============================
    
    OAGK is a keyword extraction/generation dataset consisting of 2.2 million
    abstracts, titles and keyword strings from scientific articles. 
    Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. 
    No other preprocessing steps were applied in this release version.
    Dataset records (samples) are stored as JSON lines in each text file. 
    
    This data is derived from OAG data collection 
    (https://aminer.org/open-academic-graph) which was released under 
    ODC-BY licence. 
    
    This data (OAGK Keyword Generation Dataset) is released under CC-BY licence 
    (https://creativecommons.org/licenses/by/4.0/).
    
    
    Download
    --------
    
    This dataset can be download from LINDAT/CLARIN repository
    http://hdl.handle.net/11234/1-2943
    
    
    Publications
    ------------
    
    If using it, please cite the following paper:
    
    Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text 
    Summarization Struggle, 2019 Annual Conference of the North American 
    Chapter of the Association for Computational Linguistics, June 2019, 
    Minneapolis, USA 
    
    
    Acknowledgements
    ----------------
    
    This research work was [partially] supported by OP RDE project No. 
    CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of 
    Researchers at Charles University.
    
    
    Data Statistics
    --------------- 
    
    OAGK (fullset) statistics: 
    Records: 2200000
    Total keyphrases: 13435189
    Total title tokens: 26998352
    Total abstract tokens: 499179395
    Average keyphrases: 6.106903
    Average title tokens: 12.271974
    Average abstract tokens: 226.899725
    
    oagk_train statistics:
    Records: 2000000
    Total keyphrases: 11990067
    Total title tokens: 24127290
    Total abstract tokens: 440850430
    Average keyphrases: 5.9950335
    Average title tokens: 12.063645
    Average abstract tokens: 220.425215
    
    oagk_val statistics::
    Records: 100000
    Total keyphrases: 575022
    Total title tokens: 1284088
    Total abstract tokens: 21106435
    Average keyphrases: 5.75022
    Average title tokens: 12.84088
    Average abstract tokens: 21 . . .
Name
OAGK.zip
Size
1.01 GB
Format
application/zip
Description
Zip
MD5
92b0d028cde15184add0981349baccb4
Preview
  File Preview
  • OAGK
    • oagk_train.txt2 GB
    • oagk_val.txt141 MB
    • oagk_test.txt239 MB