This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

OAGS Title Generation Dataset

Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019, OAGS Title Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3043.
Date issued
2019-09
Size
34993700 entries,
7 files,
46.8 gb,
14.8 gb
Language(s)
Description
OAGS is a title generation dataset consisting of 34993700 abstracts and titles from scientific articles. Texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY licence. This data (OAGS Title Generation Dataset) is released under CC-BY licence (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study", INLG 2019, The 12th International Conference on Natural Language Generation, November 2019, Tokyo, Japan. To reproduce the experiments in the above paper, you can use oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt and oags_val.txt files. If you need more data samples you can get them from oags_train_backup.txt and oags_val-test_backup.txt.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
OAGS.zip
Size
14.89 GB
Format
application/zip
Description
Data
MD5
b3def7c79f11d2c109c48cc0a72b88ae
Preview
  File Preview
  • OAGS
    • oags_train3.txt1 GB
    • oags_val-test_backup.txt657 MB
    • oags_val.txt14 MB
    • oags_test.txt14 MB
    • oags_train2.txt1 GB
    • oags_train_backup.txt42 GB
    • oags_train1.txt557 MB
Name
README.txt
Size
1.82 KB
Format
text/plain
Description
Readme
MD5
dbea4cf9d8eba2dae318a74c1a9dc3f0
Preview
  File Preview
    OAGS Title Generation Dataset
    ===============================
    
    OAGS is a title generation dataset consisting of 34993700 abstracts 
    and titles from scientific articles. Texts were lowercased and 
    tokenized with Stanford CoreNLP tokenizer. No other preprocessing
    steps were applied in this release version. Dataset records 
    (samples) are stored as JSON lines in each text file. 
    
    The data is derived from OAG data collection 
    (https://aminer.org/open-academic-graph) which was released 
    under ODC-BY licence. 
    
    This data (OAGS Title Generation Dataset) is released under 
    CC-BY licence (https://creativecommons.org/licenses/by/4.0/). 
    
    
    Download
    --------
    
    This dataset can be download from LINDAT/CLARIN repository
    http://hdl.handle.net/11234/1-3043
    
    
    Publications
    ------------
    
    If using it, please cite the following paper:
    
    Çano, Erion and Bojar, Ondřej, 2019, "Efficiency Metrics for 
    Data-Driven Models: A Text Summarization Case Study", INLG 2019, 
    The 12th International Conference on Natural Language Generation, 
    November 2019, Tokyo, Japan.
    
    To reproduce the experiments in the above paper, you can use 
    oags_train1.txt, oags_train2.txt, oags_train3.txt, oags_test.txt
    and oags_val.txt files. If you need more data samples you can get 
    them from oags_train_backup.txt and oags_val-test_backup.txt.
    
    
    Acknowledgements
    ----------------
    
    This research work was [partially] supported by OP RDE project No. 
    CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of 
    Researchers at Charles University.
    
    
    Statistics of Full OAGS:
    ------------------------
    
    Total records: 34993700
    Titles:
    Total tokens: 479882266
    Min length: 1 tokens
    Max length: 821 tokens
    Avg length: 12.3 tokens
    Abstracts:
    Total tokens: 7861867117
    Min length: 1 tokens
    Max length: 321610 tokens
    Avg length: 189.7 tokens