Show simple item record

 
dc.contributor.author Çano, Erion
dc.date.accessioned 2019-03-08T12:46:46Z
dc.date.available 2019-03-08T12:46:46Z
dc.date.issued 2019-04
dc.identifier.uri http://hdl.handle.net/11234/1-2943
dc.description OAGK is a keyword extraction/generation dataset consisting of 2.2 million abstracts, titles and keyword strings from cientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. This data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGK Keyword Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text Summarization Struggle, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, Minneapolis, USA
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/825460
dc.relation.isreferencedby https://www.aclweb.org/anthology/N19-1070
dc.relation.isreplacedby http://hdl.handle.net/11234/1-3062
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.subject keyword extraction
dc.subject supervised keyword generation
dc.title OAGK Keyword Generation Dataset
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds
sponsor European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460
size.info 2200000 entries
size.info 3 files
size.info 3.24 gb
files.size 1086288473
files.count 2


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
README.txt
Size
2.25 KB
Format
Text file
Description
readme
MD5
dc3560f8786a522c21ea96c4fc2f5c04
 Download file  Preview
 File Preview  
OAGK Keyword Generation Dataset
===============================

OAGK is a keyword extraction/generation dataset consisting of 2.2 million
abstracts, titles and keyword strings from scientific articles. 
Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. 
No other preprocessing steps were applied in this release version.
Dataset records (samples) are stored as JSON lines in each text file. 

This data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released under 
ODC-BY licence. 

This data (OAGK Keyword Generation Dataset) is released under CC-BY licence 
(https://creativecommons.org/licenses/by/4.0/).


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-2943


Publications
------------

If using it, please cite the following paper:

Çano, Erion and Bojar, Ondřej, 2019, Keyphrase Generation: A Text 
Summarization Struggle, 2019 Annual Conf . . .
                                            
Icon
Name
OAGK.zip
Size
1.01 GB
Format
application/zip
Description
data
MD5
92b0d028cde15184add0981349baccb4
 Download file  Preview
 File Preview  
  • OAGK
    • oagk_train.txt2 GB
    • oagk_val.txt141 MB
    • oagk_test.txt239 MB

Show simple item record