OAGL Paper Metadata Dataset
Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2020,
OAGL Paper Metadata Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-3257.
Authors
Item identifier
Referenced by
Date issued
2020-06-30
Size
5 files,
17528680 entries,
22.9 gb,
7.3 gb
Language(s)
Description
OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please cite the following paper:
Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata.
NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language
Processing and Information Retrieval, Seoul, Korea, December 2020.
Acknowledgement
European Union
Project code:H2020-ICT-2018-2-825460
Project name:ELITR - European Live Translator
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- README.txt
- Size
- 1.56 KB
- Format
- text/plain
- Description
- readme
- MD5
- 8442f638fbb2ab4d45c6c28a846b70b5

OAGL Paper Metadata Dataset =========================== OAGL is a paper metadata dataset consisting of 17528680 records which comprise various scientific publication attributes like abstracts, titles, keywords, publication years, venues, etc. The last field of each record is the page length of the corresponding publication. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGL Paper Metadata Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from: http://hdl.handle.net/11234/1-3257 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej: How Many Pages? Paper Length Prediction from the Metadata. NLPIR 2020, Proceedings of the the 4th International Conference on Natural Language Processing and Information Retrieval, Seoul, Korea, December 2020. Acknowledgements ---------------- This research work was supported by the project no. 19-26934X (NEUREM3) of the Czech Science Foundation and ELITR (H2020-ICT-2018-2-825460) of the EU. Statistics of OAGL: ------------------- Total samples: 17528680 Title tokens* mean: 11.96 std: 4.49 Abstract tokens* mean: 144.86 std: 74.98 Keywords mean: 6.74 std: 5.49 Page length mean: 6.65 std: 4.87 *These values may vary depending on how text processing is done.
- Name
- oagl.zip
- Size
- 7.28 GB
- Format
- application/zip
- Description
- Data
- MD5
- e2d6dfc1a6d7c76499e4c1c27ad86a89

- oagl
- val.txt829 kB
- test.txt1 MB
- val-test_bck.txt274 MB
- train_bck.txt22 GB
- train.txt5 MB

