OAGSX Title Generation Dataset
Please use the following text to cite this item or export to a predefined format:
Çano, Erion, 2019,
OAGSX Title Generation Dataset, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-3079.
Authors
Item identifier
Referenced by
Date issued
2019-11-01
Size
33 files,
38.8 gb,
34408509 entries,
12.4 gb
Language(s)
Description
OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file.
The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license.
This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/).
If using it, please consider citing also the following paper:
Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles.
LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation,
Marseille, France, May 2020.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.2.69/0.0/0.0/16_027/0008495
Project name:OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy
European Union
Project code:H2020-ICT-2018-2-825460
Project name:ELITR - European Live Translator
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- oagsx.zip
- Size
- 12.45 GB
- Format
- application/zip
- Description
- Zip
- MD5
- f37926dece4c79b832ecaafce6ba1f28

- oagsx
- part005.txt1 GB
- part023.txt1 GB
- part010.txt1 GB
- part028.txt1 GB
- part015.txt1 GB
- part002.txt1 GB
- part020.txt1 GB
- part007.txt1 GB
- part025.txt1 GB
- part012.txt1 GB
- part030.txt1 GB
- part017.txt1 GB
- part004.txt1 GB
- part022.txt1 GB
- part009.txt1 GB
- part027.txt1 GB
- part014.txt1 GB
- part001.txt1 GB
- part032.txt873 MB
- part019.txt950 MB
- part006.txt1 GB
- part024.txt1 GB
- part011.txt1 GB
- part029.txt1 GB
- part016.txt1 GB
- part003.txt1 GB
- part021.txt1 GB
- part008.txt1 GB
- part026.txt1 GB
- part013.txt1 GB
- part000.txt1 GB
- part031.txt875 MB
- part018.txt1 GB
- Name
- README.txt
- Size
- 1.51 KB
- Format
- text/plain
- Description
- Text
- MD5
- f8c484dee332fd01753a32507d07825e

OAGSX Title Generation Dataset ============================== OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-3079 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020 Acknowledgements ---------------- This research work was [partially] supported by OP RDE project No. CZ.02.2.69/0.0/0.0/16_027/0008495, International Mobility of Researchers at Charles University. Statistics of OAGSX: -------------------- Total samples: 34408509 Title tokens mean: 13.04 std: 5.13 min: 3 max: 25 Abstract tokens mean: 182.19 std: 89.20 min: 50 max: 400 Abs-Tit overlap mean: 0.7713 std: 0.1796

