Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)

Name: Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)
License: http://creativecommons.org/licenses/by-nc-sa/3.0/

Hajič, Jan; Náplava, Jakub; Straka, Milan

Show simple item record

dc.contributor.author	Hajič, Jan
dc.contributor.author	Náplava, Jakub
dc.contributor.author	Straka, Milan
dc.date.accessioned	2017-05-03T08:09:00Z
dc.date.available	2017-05-03T08:09:00Z
dc.date.issued	2017-04-30
dc.identifier.uri	http://hdl.handle.net/11234/1-2144
dc.description	Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset. Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
dc.language.iso	ces
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.rights	Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/
dc.subject	spelling correction
dc.subject	natural language correction
dc.title	Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Jakub Náplava naplava@ufal.mff.cuni.cz Charles University, UFAL
contact.person	Milan Straka straka@ufal.mff.cuni.cz Charles University, UFAL
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
size.info	231688 words
size.info	6 entries
files.size	10906515
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Name: 2017-czech-sec-ag.zip
Size: 10.4 MB
Format: application/zip
Description: corpus data and metadata, scripts; zipped
MD5: 7443b9d3255a51a3356edde05a479d0c

Download file Preview

File Preview

scripts
- error_model_train0.desc428 B
- error_model_train0.txt601 kB
- make_errors.py10 kB
svejk
- svejk-word2word.text61 kB
- svejk-word2word.gold61 kB
- svejk-word2simword.text61 kB
- svejk-word2simword.gold61 kB
- svejk-word2words.text62 kB
- svejk-sent2sent.text62 kB
- svejk-word2words.gold62 kB
- svejk-sent2sent.gold62 kB
data
- inputs_test.txt1 MB
- targets_dev.txt1 MB
- targets_test.txt1 MB
- inputs_dev.txt1 MB
- inputs_train.txt9 MB
- targets_train.txt9 MB
- README.md2 kB
- LICENSE.txt21 kB

Show simple item record