This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

OAGSX Title Generation Dataset

Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019, OAGSX Title Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3079.
Date issued
2019-11-01
Size
33 files,
38.8 gb,
34408509 entries,
12.4 gb
Language(s)
Description
OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please consider citing also the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020.
Acknowledgement

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
2*
2019-11-01 00:00:00
2019-09-01 00:00:00
* Selected version
This item isPublicly Available
and licensed under:
 Files in this item
Name
oagsx.zip
Size
12.45 GB
Format
application/zip
Description
Zip
MD5
f37926dece4c79b832ecaafce6ba1f28
Preview
  File Preview
  • oagsx
    • part005.txt1 GB
    • part023.txt1 GB
    • part010.txt1 GB
    • part028.txt1 GB
    • part015.txt1 GB
    • part002.txt1 GB
    • part020.txt1 GB
    • part007.txt1 GB
    • part025.txt1 GB
    • part012.txt1 GB
    • part030.txt1 GB
    • part017.txt1 GB
    • part004.txt1 GB
    • part022.txt1 GB
    • part009.txt1 GB
    • part027.txt1 GB
    • part014.txt1 GB
    • part001.txt1 GB
    • part032.txt873 MB
    • part019.txt950 MB
    • part006.txt1 GB
    • part024.txt1 GB
    • part011.txt1 GB
    • part029.txt1 GB
    • part016.txt1 GB
    • part003.txt1 GB
    • part021.txt1 GB
    • part008.txt1 GB
    • part026.txt1 GB
    • part013.txt1 GB
    • part000.txt1 GB
    • part031.txt875 MB
    • part018.txt1 GB
Name
README.txt
Size
1.51 KB
Format
text/plain
Description
Text
MD5
f8c484dee332fd01753a32507d07825e
Preview
  File Preview
    OAGSX Title Generation Dataset
    ==============================
    
    OAGSX is a title generation dataset consisting
    of 34408509 abstracts and titles from scientific 
    articles. The texts were lowercased and tokenized with 
    Stanford CoreNLP tokenizer. No other preprocessing steps
    were applied in this release version. Dataset records 
    (samples) are stored as JSON lines in each text file. 
    
    The data is derived from OAG data collection 
    (https://aminer.org/open-academic-graph) which was released 
    under ODC-BY license. 
    
    This data (OAGSX Title Generation Dataset) is released under 
    CC-BY license (https://creativecommons.org/licenses/by/4.0/). 
    
    
    Download
    --------
    
    This dataset can be download from LINDAT/CLARIN repository
    http://hdl.handle.net/11234/1-3079
    
    
    Publications
    ------------
    
    If using it, please cite the following paper:
    
    Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. 
    LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, 
    Marseille, France, May 2020
    
    
    Acknowledgements
    ----------------
    
    This research work was [partially] supported by OP RDE project No. 
    CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of 
    Researchers at Charles University.
    
    
    Statistics of OAGSX:
    --------------------
    
    Total samples:     	34408509 
    Title tokens	   	mean: 13.04 std: 5.13 min: 3 max: 25
    Abstract tokens 	mean: 182.19 std: 89.20 min: 50 max: 400
    Abs-Tit overlap		mean: 0.7713 std: 0.1796