OAGL Paper Metadata Dataset =========================== OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from: http://hdl.handle.net/11234/1-3257 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata. NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Korea, December 2020. Acknowledgements ---------------- This research work was supported by the project no. 19-26934X (NEUREM3) of the Czech Science Foundation and ELITR (H2020-ICT-2018-2-825460) of the EU. Statistics of OAGL: ------------------- Total samples: 17528680 Title tokens* mean: 11.96 std: 4.49 Abstract tokens* mean: 144.86 std: 74.98 Keywords mean: 6.74 std: 5.49 Page length mean: 6.65 std: 4.87 *These values may vary depending on how text processing is done.