dc.contributor.author | Náplava, Jakub |
dc.contributor.author | Straka, Milan |
dc.contributor.author | Straková, Jana |
dc.contributor.author | Rosen, Alexandr |
dc.date.accessioned | 2022-10-04T12:59:13Z |
dc.date.available | 2022-10-04T12:59:13Z |
dc.date.issued | 2022-01-17 |
dc.identifier.uri | http://hdl.handle.net/11234/1-4861 |
dc.description | Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information. |
dc.language.iso | ces |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.relation.isreferencedby | https://arxiv.org/pdf/2201.05590.pdf |
dc.relation.isreferencedby | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00470/110536 |
dc.relation.replaces | http://hdl.handle.net/11234/1-4639 |
dc.rights | Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-sa/4.0/ |
dc.subject | gec |
dc.subject | grammatical error correction |
dc.subject | dataset |
dc.title | GECCC Grammar Error Correction Corpus for Czech (2022-09-28) |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Jakub Náplava arahusky@seznam.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds |
sponsor | Grantová agentura Univerzity Karlovy v Praze GAUK 578218 Automatická korekce jazyka pomocí neuronových sítí nationalFunds |
sponsor | Univerzita Karlova (mimo GAUK) SVV 260 575 Specifický vysokoškolský výzkum nationalFunds |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.1.01/0.0/0.0/16_019/0000734 Kreativita a adaptabilita jako předpoklad úspěchu Evropy v propojeném světě nationalFunds |
size.info | 83058 sentences |
size.info | 24 files |
files.size | 15534408 |
files.count | 1 |
Files in this item
This item is
Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
- Name
- geccc.zip
- Size
- 14.81 MB
- Format
- application/zip
- Description
- corpus data and metadata, zipped
- MD5
- f5ee8a933ffeea6358505787303ab187
- data
- meta.tsv-1 B
- dev
- paragraph.m2-1 B
- paragraph.meta-1 B
- sentence.meta-1 B
- sentence.input-1 B
- paragraph.gold-1 B
- sentence.m2-1 B
- sentence.gold-1 B
- paragraph.input-1 B
- train
- paragraph.m2-1 B
- paragraph.meta-1 B
- sentence.meta-1 B
- sentence.input-1 B
- paragraph.gold-1 B
- sentence.m2-1 B
- sentence.gold-1 B
- paragraph.input-1 B
- test
- paragraph.m2-1 B
- paragraph.meta-1 B
- sentence.meta-1 B
- sentence.input-1 B
- paragraph.gold-1 B
- sentence.m2-1 B
- sentence.gold-1 B
- paragraph.input-1 B
- detokenizer.perl-1 B
- LICENSE-1 B
- README.md-1 B