dc.contributor.author | Çano, Erion |
dc.date.accessioned | 2019-10-31T09:04:42Z |
dc.date.available | 2019-10-31T09:04:42Z |
dc.date.issued | 2019-11-01 |
dc.identifier.uri | http://hdl.handle.net/11234/1-3079 |
dc.description | OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). If using it, please consider citing also the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th International Conference on Language Resources and Evaluation, Marseille, France, May 2020. |
dc.language.iso | eng |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation | info:eu-repo/grantAgreement/EC/H2020/825460 |
dc.relation.isreferencedby | https://www.aclweb.org/anthology/2020.lrec-1.823 |
dc.relation.replaces | http://hdl.handle.net/11234/1-3043 |
dc.rights | Creative Commons - Attribution 4.0 International (CC BY 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ |
dc.subject | Title Generation Dataset |
dc.subject | Abstractive Text Summarization |
dc.subject | Scientific Papers Corpus |
dc.title | OAGSX Title Generation Dataset |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Erion Çano cano@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.2.69/0.0/0.0/16_027/0008495 OP VVV Mezinárodní mobilita výzkumných pracovníků Univerzity Karlovy nationalFunds |
sponsor | European Union H2020-ICT-2018-2-825460 ELITR - European Live Translator euFunds info:eu-repo/grantAgreement/EC/H2020/825460 |
size.info | 33 files |
size.info | 38.8 gb |
size.info | 34408509 entries |
size.info | 12.4 gb |
files.size | 13363566273 |
files.count | 2 |
Files in this item
This item is
Creative Commons - Attribution 4.0 International (CC BY 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution 4.0 International (CC BY 4.0)
- Name
- oagsx.zip
- Size
- 12.45 GB
- Format
- application/zip
- Description
- data
- MD5
- f37926dece4c79b832ecaafce6ba1f28
- oagsx
- part005.txt1 GB
- part023.txt1 GB
- part010.txt1 GB
- part028.txt1 GB
- part015.txt1 GB
- part002.txt1 GB
- part020.txt1 GB
- part007.txt1 GB
- part025.txt1 GB
- part012.txt1 GB
- part030.txt1 GB
- part017.txt1 GB
- part004.txt1 GB
- part022.txt1 GB
- part009.txt1 GB
- part027.txt1 GB
- part014.txt1 GB
- part032.txt873 MB
- part001.txt1 GB
- part019.txt950 MB
- part006.txt1 GB
- part024.txt1 GB
- part011.txt1 GB
- part029.txt1 GB
- part016.txt1 GB
- part003.txt1 GB
- part021.txt1 GB
- part008.txt1 GB
- part026.txt1 GB
- part013.txt1 GB
- part031.txt875 MB
- part000.txt1 GB
- part018.txt1 GB
- Name
- README.txt
- Size
- 1.51 KB
- Format
- Text file
- Description
- readme (updated on 2020-06-02)
- MD5
- f8c484dee332fd01753a32507d07825e
OAGSX Title Generation Dataset ============================== OAGSX is a title generation dataset consisting of 34408509 abstracts and titles from scientific articles. The texts were lowercased and tokenized with Stanford CoreNLP tokenizer. No other preprocessing steps were applied in this release version. Dataset records (samples) are stored as JSON lines in each text file. The data is derived from OAG data collection (https://aminer.org/open-academic-graph) which was released under ODC-BY license. This data (OAGSX Title Generation Dataset) is released under CC-BY license (https://creativecommons.org/licenses/by/4.0/). Download -------- This dataset can be download from LINDAT/CLARIN repository http://hdl.handle.net/11234/1-3079 Publications ------------ If using it, please cite the following paper: Çano Erion, Bojar Ondřej. Two Huge Title and Keyword Generation Corpora of Research Articles. LREC 2020, Proceedings of the the 12th In . . .