Show simple item record

 
dc.contributor.author Çano, Erion
dc.date.accessioned 2019-09-12T10:47:34Z
dc.date.available 2019-09-12T10:47:34Z
dc.date.issued 2019-09
dc.identifier.uri http://hdl.handle.net/11234/1-3043
dc.description OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th International Conference on Natural Language Generation, November 2019, Tokyo, Japan. To reproduce the experiments in the above paper, you can use oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt and oags_val.txt files. If you need more data samples you can get them from oags_train_backup.txt and oags_val-test_backup.txt.
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/825460
dc.relation.isreferencedby https://www.aclweb.org/anthology/W19-8630/
dc.relation.isreplacedby http://hdl.handle.net/11234/1-3079
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.subject Title Generation Dataset
dc.subject Abstractive Text Summarization
dc.subject Scientific Papers Corpus
dc.title OAGS Title Generation Dataset
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds
sponsor European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460
size.info 34993700 entries
size.info 7 files
size.info 46.8 gb
size.info 14.8 gb
files.size 15992457976
files.count 2


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
README.txt
Size
1.82 KB
Format
Text file
Description
Readme
MD5
dbea4cf9d8eba2dae318a74c1a9dc3f0
 Download file  Preview
 File Preview  
OAGS Title Generation Dataset
===============================

OAGS is a title generation dataset consisting of 34993700 abstracts 
and titles from scientific articles. Texts were lowercased and 
tokenized with Stanford CoreNLP tokenizer. No other preprocessing
steps were applied in this release version. Dataset records 
(samples) are stored as JSON lines in each text file. 

The data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released 
under ODC-BY licence. 

This data (OAGS Title Generation Dataset) is released under 
CC-BY licence (https://creativecommons.org/licenses/by/4.0/). 


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-3043


Publications
------------

If using it, please cite the following paper:

Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for 
Data-Driven Models: A Text Summarization Case Study", INLG 2019, 
The 12th Inter . . .
                                            
Icon
Name
OAGS.zip
Size
14.89 GB
Format
application/zip
Description
Data
MD5
b3def7c79f11d2c109c48cc0a72b88ae
 Download file  Preview
 File Preview  
  • OAGS
    • oags_train3.txt1 GB
    • oags_val.txt14 MB
    • oags_val-test_backup.txt657 MB
    • oags_train2.txt1 GB
    • oags_test.txt14 MB
    • oags_train_backup.txt42 GB
    • oags_train1.txt557 MB

Show simple item record