Soubory tohoto záznamu

Licenční kategorie:
Publicly Available

Licence: Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Název
README.txt
Velikost
2.25 KB
Formát
Textový soubor
Popis
readme
MD5
dc3560f8786a522c21ea96c4fc2f5c04
 Stáhnout soubor  Náhled
 Náhled souboru  
OAGK Keyword Generation Dataset
===============================

OAGK is a keyword extraction/generation dataset consisting of 2.2 million
abstracts, titles and keyword strings from scientific articles. 
Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. 
No other preprocessing steps were applied in this release version.
Dataset records (samples) are stored as JSON lines in each text file. 

This data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released under 
ODC-BY licence. 

This data (OAGK Keyword Generation Dataset) is released under CC-BY licence 
(https://creativecommons.org/licenses/by/4.0/).


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-2943


Publications
------------

If using it, please cite the following paper:

Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text 
Summarization Struggle, 2019 Annual Conf . . .
                                            
Icon
Název
OAGK.zip
Velikost
1.01 GB
Formát
application/zip
Popis
data
MD5
92b0d028cde15184add0981349baccb4
 Stáhnout soubor  Náhled
 Náhled souboru  
  • OAGK
    • oagk_train.txt2 GB
    • oagk_val.txt141 MB
    • oagk_test.txt239 MB