Soubory tohoto záznamu

Licenční kategorie:
Publicly Available

Licence: Creative Commons - Attribution 4.0 International (CC BY 4.0)
Distributed under Creative Commons Attribution Required
Icon
Název
oagsx.zip
Velikost
12.45 GB
Formát
application/zip
Popis
data
MD5
f37926dece4c79b832ecaafce6ba1f28
 Stáhnout soubor  Náhled
 Náhled souboru  
  • oagsx
    • part005.txt1 GB
    • part023.txt1 GB
    • part010.txt1 GB
    • part028.txt1 GB
    • part015.txt1 GB
    • part002.txt1 GB
    • part020.txt1 GB
    • part007.txt1 GB
    • part025.txt1 GB
    • part012.txt1 GB
    • part030.txt1 GB
    • part017.txt1 GB
    • part004.txt1 GB
    • part022.txt1 GB
    • part009.txt1 GB
    • part027.txt1 GB
    • part014.txt1 GB
    • part032.txt873 MB
    • part001.txt1 GB
    • part019.txt950 MB
    • part006.txt1 GB
    • part024.txt1 GB
    • part011.txt1 GB
    • part029.txt1 GB
    • part016.txt1 GB
    • part003.txt1 GB
    • part021.txt1 GB
    • part008.txt1 GB
    • part026.txt1 GB
    • part013.txt1 GB
    • part031.txt875 MB
    • part000.txt1 GB
    • part018.txt1 GB
Icon
Název
README.txt
Velikost
1.51 KB
Formát
Textový soubor
Popis
readme (updated on 2020-06-02)
MD5
f8c484dee332fd01753a32507d07825e
 Stáhnout soubor  Náhled
 Náhled souboru  
OAGSX Title Generation Dataset
==============================

OAGSX is a title generation dataset consisting
of 34408509 abstracts and titles from scientific 
articles. The texts were lowercased and tokenized with 
Stanford CoreNLP tokenizer. No other preprocessing steps
were applied in this release version. Dataset records 
(samples) are stored as JSON lines in each text file. 

The data is derived from OAG data collection 
(https://aminer.org/open-academic-graph) which was released 
under ODC-BY license. 

This data (OAGSX Title Generation Dataset) is released under 
CC-BY license (https://creativecommons.org/licenses/by/4.0/). 


Download
--------

This dataset can be download from LINDAT/CLARIN repository
http://hdl.handle.net/11234/1-3079


Publications
------------

If using it, please cite the following paper:

Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. 
LREC 2020, Proceedings of the the 12th In . . .