This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

OAGL Paper Metadata Dataset

Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2020, OAGL Paper Metadata Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-3257.
Date issued
2020-06-30
Size
5 files,
17528680 entries,
22.9 gb,
7.3 gb
Language(s)
Description
OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please cite the following paper: Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata. NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Korea, December 2020.
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
README.txt
Size
1.56 KB
Format
text/plain
Description
readme
MD5
8442f638fbb2ab4d45c6c28a846b70b5
Preview
  File Preview
    OAGL Paper Metadata Dataset
    ===========================
    
    OAGL is a paper metadata dataset consisting
    of 17528680 records which comprise various scientific 
    publication attributes like abstracts, titles, keywords,
    publication years, venues, etc. The last field of each
    record is the page length of the corresponding publication. 
    Dataset records (samples) are stored as JSON lines in each 
    text file. 
    
    The data is derived from OAG data collection 
    (https://aminer.org/open-academic-graph) which was released 
    under ODC-BY license. 
    
    This data (OAGL Paper Metadata Dataset) is released under 
    CC-BY license (https://creativecommons.org/licenses/by/4.0/). 
    
    
    Download
    --------
    
    This dataset can be download from:
    http://hdl.handle.net/11234/1-3257
    
    
    Publications
    ------------
    
    If using it, please cite the following paper:
    
    Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata. 
    NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language 
    Processing and Information Retrieval, Seoul, Korea, December 2020.
    
    
    Acknowledgements
    ----------------
    
    This research work was supported by the project no. 19-26934X
    (NEUREM3) of the Czech Science Foundation and ELITR
    (H2020-ICT-2018-2-825460) of the EU.
    
    
    Statistics of OAGL:
    -------------------
    
    Total samples:     	17528680 
    Title tokens*	   	mean: 11.96 	std: 4.49 
    Abstract tokens*	mean: 144.86 	std: 74.98 
    Keywords			mean: 6.74 		std: 5.49
    Page length			mean: 6.65		std: 4.87
    
    *These values may vary depending on how text processing is done.
    
Name
oagl.zip
Size
7.28 GB
Format
application/zip
Description
Data
MD5
e2d6dfc1a6d7c76499e4c1c27ad86a89
Preview
  File Preview
  • oagl
    • val.txt829 kB
    • test.txt1 MB
    • val-test_bck.txt274 MB
    • train_bck.txt22 GB
    • train.txt5 MB