GECCC Grammar Error Correction Corpus for Czech (2022-09-28)
Please use the following text to cite this item or export to a predefined format:
Náplava, Jakub; Straka, Milan; Straková, Jana and Rosen, Alexandr, 2022,
GECCC Grammar Error Correction Corpus for Czech (2022-09-28), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-4861.
Authors
Item identifier
Date issued
2022-01-17
Size
83058 sentences,
24 files
Language(s)
Description
Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech
The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf
This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information.
Acknowledgement
Grantová agentura České republiky
Project code:GX20-16819X
Project name:LUSyD – Language Understanding: from Syntax to Discourse
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2018101
Project name:LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
Grantová agentura Univerzity Karlovy v Praze
Project code:GAUK 578218
Project name:Automatická korekce jazyka pomocí neuronových sítí
Univerzita Karlova (mimo GAUK)
Project code:SVV 260 575
Project name:Specifický vysokoškolský výzkum
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:CZ.02.1.01/0.0/0.0/16_019/0000734
Project name:Kreativita a adaptabilita jako předpoklad úspěchu Evropy v propojeném světě
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- geccc.zip
- Size
- 14.81 MB
- Format
- application/zip
- Description
- Zip
- MD5
- f5ee8a933ffeea6358505787303ab187

- data
- meta.tsv950 kB
- dev
- paragraph.m21 MB
- paragraph.meta81 kB
- sentence.meta191 kB
- sentence.input523 kB
- paragraph.gold817 kB
- sentence.m22 MB
- sentence.gold823 kB
- paragraph.input518 kB
- train
- paragraph.m212 MB
- paragraph.meta460 kB
- sentence.meta1 MB
- sentence.input3 MB
- paragraph.gold4 MB
- sentence.m213 MB
- sentence.gold4 MB
- paragraph.input3 MB
- test
- paragraph.m22 MB
- paragraph.meta70 kB
- sentence.meta170 kB
- sentence.input506 kB
- paragraph.gold1009 kB
- sentence.m22 MB
- sentence.gold1009 kB
- paragraph.input507 kB
-
- detokenizer.perl12 kB
- LICENSE19 kB
- README.md3 kB

