Show simple item record

 
dc.contributor.author Hajič, Jan
dc.contributor.author Náplava, Jakub
dc.contributor.author Straka, Milan
dc.date.accessioned 2017-05-03T08:09:00Z
dc.date.available 2017-05-03T08:09:00Z
dc.date.issued 2017-04-30
dc.identifier.uri http://hdl.handle.net/11234/1-2144
dc.description Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset. Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
dc.language.iso ces
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.rights Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/3.0/
dc.subject spelling correction
dc.subject natural language correction
dc.title Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Jakub Náplava naplava@ufal.mff.cuni.cz Charles University, UFAL
contact.person Milan Straka straka@ufal.mff.cuni.cz Charles University, UFAL
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
size.info 231688 words
size.info 6 entries
files.size 10906515
files.count 1


 Files in this item

This item is
Publicly Available
and licensed under:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
Distributed under Creative Commons Attribution Required Noncommercial Share Alike
Icon
Name
2017-czech-sec-ag.zip
Size
10.4 MB
Format
application/zip
Description
corpus data and metadata, scripts; zipped
MD5
7443b9d3255a51a3356edde05a479d0c
 Download file  Preview
 File Preview  
  • scripts
    • error_model_train0.desc428 B
    • error_model_train0.txt601 kB
    • make_errors.py10 kB
  • svejk
    • svejk-word2word.text61 kB
    • svejk-word2word.gold61 kB
    • svejk-word2simword.text61 kB
    • svejk-word2simword.gold61 kB
    • svejk-word2words.text62 kB
    • svejk-sent2sent.text62 kB
    • svejk-word2words.gold62 kB
    • svejk-sent2sent.gold62 kB
  • data
    • inputs_test.txt1 MB
    • targets_dev.txt1 MB
    • targets_test.txt1 MB
    • inputs_dev.txt1 MB
    • inputs_train.txt9 MB
    • targets_train.txt9 MB
    • README.md2 kB
    • LICENSE.txt21 kB

Show simple item record