Corpus for training and evaluating diacritics restoration systems

Náplava, Jakub; Straka, Milan; Hajič, Jan; Straňák, Pavel

dc.contributor.author	Náplava, Jakub
dc.contributor.author	Straka, Milan
dc.contributor.author	Hajič, Jan
dc.contributor.author	Straňák, Pavel
dc.date.accessioned	2018-03-05T14:37:18Z
dc.date.available	2018-03-05T14:37:18Z
dc.date.issued	2018-01-31
dc.identifier.uri	http://hdl.handle.net/11234/1-2607
dc.description	Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
dc.language.iso	ces
dc.language.iso	vie
dc.language.iso	ron
dc.language.iso	pol
dc.language.iso	slk
dc.language.iso	spa
dc.language.iso	hrv
dc.language.iso	gle
dc.language.iso	lav
dc.language.iso	hun
dc.language.iso	fra
dc.language.iso	tur
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject	diacritical marks generation
dc.subject	natural language correction
dc.title	Corpus for training and evaluating diacritics restoration systems
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Jakub Náplava naplava@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
size.info	48 entries
files.size	18647731256
files.count	13

Soubory tohoto záznamu

Licenční kategorie:

Publicly Available

Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Název: cs.zip
Velikost: 2.12 GB
Formát: application/zip
Popis: Unknown
MD5: 6b03e4cbceaf597bbbb699c7043bd556

Stáhnout soubor Náhled

Náhled souboru

cs
- target_train.txt.xz28 MB
- target_dev.txt.xz537 kB
- statmt_2017_17_train_target_sentences.txt.xz2 GB
- target_test.txt.xz1 MB

Název: es.zip
Velikost: 3.18 GB
Formát: application/zip
Popis: Unknown
MD5: f4414bbe3f9128553ebb589e3148f9ae

Stáhnout soubor Náhled

Náhled souboru

es
- target_train.txt.xz61 MB
- target_dev.txt.xz645 kB
- statmt_2017_17_train_target_sentences.txt.xz3 GB
- target_test.txt.xz1 MB

Název: fr.zip
Velikost: 3.17 GB
Formát: application/zip
Popis: Unknown
MD5: ee53f56e4b9907d6f7285460147dce58

Stáhnout soubor Náhled

Náhled souboru

fr
- target_train.txt.xz59 MB
- target_dev.txt.xz626 kB
- statmt_2017_17_train_target_sentences.txt.xz3 GB
- target_test.txt.xz1 MB

Název: ga.zip
Velikost: 10.62 MB
Formát: application/zip
Popis: Unknown
MD5: 02f92d01d9e8fe839ebdc3a11d923ada

Stáhnout soubor Náhled

Náhled souboru

ga
- target_train.txt.xz1 MB
- target_dev.txt.xz507 kB
- statmt_2017_17_train_target_sentences.txt.xz7 MB
- target_test.txt.xz966 kB

Název: hr.zip
Velikost: 327.78 MB
Formát: application/zip
Popis: Unknown
MD5: 2ab2f0a4ffe91ec9afd007817c3f003b

Stáhnout soubor Náhled

Náhled souboru

hr
- target_train.txt.xz24 MB
- target_dev.txt.xz569 kB
- statmt_2017_17_train_target_sentences.txt.xz302 MB
- target_test.txt.xz1 MB

Název: hu.zip
Velikost: 1.81 GB
Formát: application/zip
Popis: Unknown
MD5: baa30350736a10639ea6b584ba1632ea

Stáhnout soubor Náhled

Náhled souboru

hu
- target_train.txt.xz36 MB
- target_dev.txt.xz570 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB

Název: lv.zip
Velikost: 148.56 MB
Formát: application/zip
Popis: Unknown
MD5: 48880586dc8e16285d6f2c7cf121fe5b

Stáhnout soubor Náhled

Náhled souboru

lv
- target_train.txt.xz8 MB
- target_dev.txt.xz498 kB
- statmt_2017_17_train_target_sentences.txt.xz138 MB
- target_test.txt.xz952 kB

Název: pl.zip
Velikost: 1.47 GB
Formát: application/zip
Popis: Unknown
MD5: 3af730ee2899c7bcb54fc6a50e2c0d1e

Stáhnout soubor Náhled

Náhled souboru

pl
- target_train.txt.xz32 MB
- target_dev.txt.xz584 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB

Název: ro.zip
Velikost: 718.05 MB
Formát: application/zip
Popis: Unknown
MD5: a1d886a46f25c3b59404c6d15fba862d

Stáhnout soubor Náhled

Náhled souboru

ro
- target_train.txt.xz26 MB
- target_dev.txt.xz623 kB
- statmt_2017_17_train_target_sentences.txt.xz689 MB
- target_test.txt.xz1 MB

Název: sk.zip
Velikost: 520.57 MB
Formát: application/zip
Popis: Unknown
MD5: 6e925c378a93b736bd40ca9324fadb21

Stáhnout soubor Náhled

Náhled souboru

sk
- target_train.txt.xz17 MB
- target_dev.txt.xz564 kB
- statmt_2017_17_train_target_sentences.txt.xz501 MB
- target_test.txt.xz1 MB

Název: tr.zip
Velikost: 2.88 GB
Formát: application/zip
Popis: Unknown
MD5: 9a6169d73187f8112f8eef75c6e40dc4

Stáhnout soubor Náhled

Náhled souboru

tr
- target_train.txt.xz21 MB
- target_dev.txt.xz515 kB
- statmt_2017_17_train_target_sentences.txt.xz2 GB
- target_test.txt.xz909 kB

Název: vi.zip
Velikost: 1.05 GB
Formát: application/zip
Popis: Unknown
MD5: 987b9e324056d52bd27ce3bfb3a0ebe2

Stáhnout soubor Náhled

Náhled souboru

vi
- target_train.txt.xz23 MB
- target_dev.txt.xz587 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB

Název: stripping_diacritics.zip
Velikost: 15.91 KB
Formát: application/zip
Popis: script for stripping diacritics
MD5: 0a3f98a7a17534acaae8aaac461cd9fa

Stáhnout soubor Náhled

Náhled souboru

- diacritization_stripping_data.py55 kB
- diacritization_stripping.py1 kB

Zobrazit minimální záznam