Show simple item record

 
dc.contributor.author Çano, Erion
dc.date.accessioned 2019-10-31T09:04:42Z
dc.date.available 2019-10-31T09:04:42Z
dc.date.issued 2019-11-01
dc.identifier.uri http://hdl.handle.net/11234/1-3079
dc.description OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please consider citing also the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020.
dc.language.iso eng
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/825460
dc.relation.isreferencedby https://www.aclweb.org/anthology/2020.lrec-1.823
dc.relation.replaces http://hdl.handle.net/11234/1-3043
dc.rights Creative Commons - Attribution 4.0 International (CC BY 4.0)
dc.rights.uri http://creativecommons.org/licenses/by/4.0/
dc.subject Title Generation Dataset
dc.subject Abstractive Text Summarization
dc.subject Scientific Papers Corpus
dc.title OAGSX Title Generation Dataset
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds
sponsor European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460
size.info 33 files
size.info 38.8 gb
size.info 34408509 entries
size.info 12.4 gb
files.size 13363566273
files.count 2


 Files in this item

This item is
Publicly Available
and licensed under:
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Name
oagsx.zip
Size
12.45 GB
Format
application/zip
Description
data
MD5
f37926dece4c79b832ecaafce6ba1f28
 Download file  Preview
 File Preview  
  • oagsx
    • part005.txt1 GB
    • part023.txt1 GB
    • part010.txt1 GB
    • part028.txt1 GB
    • part015.txt1 GB
    • part002.txt1 GB
    • part020.txt1 GB
    • part007.txt1 GB
    • part025.txt1 GB
    • part012.txt1 GB
    • part030.txt1 GB
    • part017.txt1 GB
    • part004.txt1 GB
    • part022.txt1 GB
    • part009.txt1 GB
    • part027.txt1 GB
    • part014.txt1 GB
    • part032.txt873 MB
    • part001.txt1 GB
    • part019.txt950 MB
    • part006.txt1 GB
    • part024.txt1 GB
    • part011.txt1 GB
    • part029.txt1 GB
    • part016.txt1 GB
    • part003.txt1 GB
    • part021.txt1 GB
    • part008.txt1 GB
    • part026.txt1 GB
    • part013.txt1 GB
    • part031.txt875 MB
    • part000.txt1 GB
    • part018.txt1 GB
Icon
Name
README.txt
Size
1.51 KB
Format
Text file
Description
readme (updated on 2020-06-02)
MD5
f8c484dee332fd01753a32507d07825e
 Download file  Preview
 File Preview  
OAGSX Title Generation Dataset
==============================

OAGSX is a title generation dataset consisting
of 34408509 abstracts and titles from scientific 
articles. The texts were lowercased and tokenized with 
Stanford CoreNLP tokenizer. No other preprocessing steps
were applied in this release version. Dataset records 
(samples) are stored as JSON lines in each text file. 

The data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released 
under ODC-BY license. 

This data (OAGSX Title Generation Dataset) is released under 
CC-BY license (https://creativecommons.org/licenses/by/4.0/). 


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-3079


Publications
------------

If using it, please cite the following paper:

Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. 
LREC 2020, Proceedings of the the 12th In . . .
                                            

Show simple item record