Zobrazit minimální záznam

 
dc.contributor.author Náplava, Jakub
dc.contributor.author Straka, Milan
dc.contributor.author Straková, Jana
dc.contributor.author Rosen, Alexandr
dc.date.accessioned 2022-10-04T12:59:13Z
dc.date.available 2022-10-04T12:59:13Z
dc.date.issued 2022-01-17
dc.identifier.uri http://hdl.handle.net/11234/1-4861
dc.description Grammar Error Correction Corpus for Czech (GECCC) consists of 83 058 sentences and covers four diverse domains, including essays written by native students, informal website texts, essays written by Romani ethnic minority children and teenagers and essays written by nonnative speakers. All domains are professionally annotated for GEC errors in a unified manner, and errors were automatically categorized with a Czech-specific version of ERRANT released at https://github.com/ufal/errant_czech The dataset was introduced in the paper Czech Grammar Error Correction with a Large and Diverse Corpus that was accepted to TACL. Until published in TACL, see the arXiv version: https://arxiv.org/pdf/2201.05590.pdf This version fixes double annotation errors in train and dev M2 files, and also contains more metadata information.
dc.language.iso ces
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedby https://arxiv.org/pdf/2201.05590.pdf
dc.relation.isreferencedby https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00470/110536
dc.relation.replaces http://hdl.handle.net/11234/1-4639
dc.rights Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-sa/4.0/
dc.subject gec
dc.subject grammatical error correction
dc.subject dataset
dc.title GECCC Grammar Error Correction Corpus for Czech (2022-09-28)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Jakub Náplava arahusky@seznam.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor Grantová agentura České republiky GX20-16819X LUSyD – Language Understanding: from Syntax to Discourse nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2018101 LINDAT/CLARIAH-CZ: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy nationalFunds
sponsor Grantová agentura Univerzity Karlovy v Praze GAUK 578218 Automatická korekce jazyka pomocí neuronových sítí nationalFunds
sponsor Univerzita Karlova (mimo GAUK) SVV 260 575 Specifický vysokoškolský výzkum nationalFunds
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.1.01/0.0/0.0/16_019/0000734 Kreativita a adaptabilita jako předpoklad úspěchu Evropy v propojeném světě nationalFunds
size.info 83058 sentences
size.info 24 files
files.size 15534408
files.count 1


 Soubory tohoto záznamu

Licenční kategorie:
Publicly Available

Licence: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Distributed under Creative Commons Attribution Required Share Alike
Icon
Název
geccc.zip
Velikost
14.81 MB
Formát
application/zip
Popis
corpus data and metadata, zipped
MD5
f5ee8a933ffeea6358505787303ab187
 Stáhnout soubor  Náhled
 Náhled souboru  
  • data
    • meta.tsv-1 B
    • dev
      • paragraph.m2-1 B
      • paragraph.meta-1 B
      • sentence.meta-1 B
      • sentence.input-1 B
      • paragraph.gold-1 B
      • sentence.m2-1 B
      • sentence.gold-1 B
      • paragraph.input-1 B
    • train
      • paragraph.m2-1 B
      • paragraph.meta-1 B
      • sentence.meta-1 B
      • sentence.input-1 B
      • paragraph.gold-1 B
      • sentence.m2-1 B
      • sentence.gold-1 B
      • paragraph.input-1 B
    • test
      • paragraph.m2-1 B
      • paragraph.meta-1 B
      • sentence.meta-1 B
      • sentence.input-1 B
      • paragraph.gold-1 B
      • sentence.m2-1 B
      • sentence.gold-1 B
      • paragraph.input-1 B
    • detokenizer.perl-1 B
    • LICENSE-1 B
    • README.md-1 B

Zobrazit minimální záznam