Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)
Please use the following text to cite this item or export to a predefined format:
Hajič, Jan; Náplava, Jakub and Straka, Milan, 2017,
Automatically generated spelling correction corpus for Czech (Czech-SEC-AG), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-2144.
Authors
Item identifier
Date issued
2017-04-30
Size
231688 words,
6 entries
Language(s)
Description
Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset.
Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2015071
Project name:LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
Subject(s)
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- 2017-czech-sec-ag.zip
- Size
- 10.4 MB
- Format
- application/zip
- Description
- corpus data and metadata, scripts; zipped
- MD5
- 7443b9d3255a51a3356edde05a479d0c

- scripts
- error_model_train0.desc428 B
- error_model_train0.txt601 kB
- make_errors.py10 kB
- svejk
- svejk-word2word.text61 kB
- svejk-word2word.gold61 kB
- svejk-word2simword.text61 kB
- svejk-word2simword.gold61 kB
- svejk-word2words.text62 kB
- svejk-sent2sent.text62 kB
- svejk-word2words.gold62 kB
- svejk-sent2sent.gold62 kB
- data
- inputs_test.txt1 MB
- targets_dev.txt1 MB
- inputs_dev.txt1 MB
- targets_test.txt1 MB
- targets_train.txt9 MB
- inputs_train.txt9 MB
-
- README.md2 kB
- LICENSE.txt21 kB

