This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
Please use the following text to cite this item or export to a predefined format:
Náplava, Jakub; Straka, Milan; Straková, Jana and Rosen, Alexandr, 2022, GECCC Grammar Error Correction Corpus for Czech (2022-09-28), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-4861.
dc.contributor.authorNáplava, Jakub
dc.contributor.authorStraka, Milan
dc.contributor.authorStraková, Jana
dc.contributor.authorRosen, Alexandr
dc.date.accessioned2022-10-04T12:59:13Z
dc.date.available2022-10-04T12:59:13Z
dc.date.issued2022-01-17
dc.descriptionGrammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information.
dc.identifier.urihttp://hdl.handle.net/11234/1-4861
dc.language.isoces
dc.publisherCharles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedbyhttps://arxiv.org/pdf/2201.05590.pdf
dc.relation.isreferencedbyhttps://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00470/110536
dc.relation.replaceshttp://hdl.handle.net/11234/1-4639
dc.rightsCreative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.labelPUB
dc.rights.urihttp://creativecommons.org/licenses/by-sa/4.0/
dc.subjectgec
dc.subjectgrammatical error correction
dc.subjectdataset
dc.titleGECCC Grammar Error Correction Corpus for Czech (2022-09-28)
dc.typecorpus
local.brandingLINDAT / CLARIAH-CZ
local.contact.personJakub Náplava arahusky@seznam.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
local.files.count1
local.files.size15534408
local.has.filesyes
local.language.nameCzech
local.size.info83058 sentences
local.size.info24 files
local.sponsornationalFunds GX20-16819X Grantová agentura České republiky LUSyD – Language Understanding: from Syntax to Discourse
local.sponsornationalFunds LM2018101 Ministerstvo školství, mládeže a tělovýchovy České republiky LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy
local.sponsornationalFunds GAUK 578218 Grantová agentura Univerzity Karlovy v Praze Automatická korekce jazyka pomocí neuronových sítí
local.sponsornationalFunds SVV 260 575 Univerzita Karlova (mimo GAUK) Specifický vysokoškolský výzkum
local.sponsornationalFunds CZ.02.1.01/0.0/0.0/16_019/0000734 Ministerstvo školství, mládeže a tělovýchovy České republiky Kreativita a adaptabilita jako předpoklad úspěchu Evropy v propojeném světě
metashare.ResourceInfo#ContentInfo.mediaTypetext
 Files in this item
Name
geccc.zip
Size
14.81 MB
Format
application/zip
Description
corpus data and metadata, zipped
MD5
f5ee8a933ffeea6358505787303ab187
Preview
  File Preview
  • data
    • meta.tsv950 kB
    • dev
      • paragraph.m21 MB
      • paragraph.meta81 kB
      • sentence.meta191 kB
      • sentence.input523 kB
      • paragraph.gold817 kB
      • sentence.m22 MB
      • sentence.gold823 kB
      • paragraph.input518 kB
    • train
      • paragraph.m212 MB
      • paragraph.meta460 kB
      • sentence.meta1 MB
      • sentence.input3 MB
      • paragraph.gold4 MB
      • sentence.m213 MB
      • sentence.gold4 MB
      • paragraph.input3 MB
    • test
      • paragraph.m22 MB
      • paragraph.meta70 kB
      • sentence.meta170 kB
      • sentence.input506 kB
      • paragraph.gold1009 kB
      • sentence.m22 MB
      • sentence.gold1009 kB
      • paragraph.input507 kB
    • detokenizer.perl12 kB
    • LICENSE19 kB
    • README.md3 kB