Corpus for training and evaluating diacritics restoration systems
Please use the following text to cite this item or export to a predefined format:
Náplava, Jakub; Straka, Milan; Hajič, Jan and Straňák, Pavel, 2018,
Corpus for training and evaluating diacritics restoration systems, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-2607.
Authors
Item identifier
Date issued
2018-01-31
Size
48 entries
Description
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized.
All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better.
The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2015071
Project name:LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- lv.zip
- Size
- 148.56 MB
- Format
- application/zip
- Description
- Zip
- MD5
- 48880586dc8e16285d6f2c7cf121fe5b

- lv
- target_train.txt.xz8 MB
- target_dev.txt.xz498 kB
- statmt_2017_17_train_target_sentences.txt.xz138 MB
- target_test.txt.xz952 kB
- Name
- vi.zip
- Size
- 1.05 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 987b9e324056d52bd27ce3bfb3a0ebe2

- vi
- target_train.txt.xz23 MB
- target_dev.txt.xz587 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB
- Name
- sk.zip
- Size
- 520.57 MB
- Format
- application/zip
- Description
- Zip
- MD5
- 6e925c378a93b736bd40ca9324fadb21

- sk
- target_train.txt.xz17 MB
- target_dev.txt.xz564 kB
- statmt_2017_17_train_target_sentences.txt.xz501 MB
- target_test.txt.xz1 MB
- Name
- ga.zip
- Size
- 10.62 MB
- Format
- application/zip
- Description
- Zip
- MD5
- 02f92d01d9e8fe839ebdc3a11d923ada

- ga
- target_train.txt.xz1 MB
- target_dev.txt.xz507 kB
- statmt_2017_17_train_target_sentences.txt.xz7 MB
- target_test.txt.xz966 kB
- Name
- hu.zip
- Size
- 1.81 GB
- Format
- application/zip
- Description
- Zip
- MD5
- baa30350736a10639ea6b584ba1632ea

- hu
- target_train.txt.xz36 MB
- target_dev.txt.xz570 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB
- Name
- cs.zip
- Size
- 2.12 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 6b03e4cbceaf597bbbb699c7043bd556

- cs
- target_train.txt.xz28 MB
- target_dev.txt.xz537 kB
- statmt_2017_17_train_target_sentences.txt.xz2 GB
- target_test.txt.xz1 MB
- Name
- es.zip
- Size
- 3.18 GB
- Format
- application/zip
- Description
- Zip
- MD5
- f4414bbe3f9128553ebb589e3148f9ae

- es
- target_train.txt.xz61 MB
- target_dev.txt.xz645 kB
- statmt_2017_17_train_target_sentences.txt.xz3 GB
- target_test.txt.xz1 MB
- Name
- fr.zip
- Size
- 3.17 GB
- Format
- application/zip
- Description
- Zip
- MD5
- ee53f56e4b9907d6f7285460147dce58

- fr
- target_train.txt.xz59 MB
- target_dev.txt.xz626 kB
- statmt_2017_17_train_target_sentences.txt.xz3 GB
- target_test.txt.xz1 MB
- Name
- tr.zip
- Size
- 2.88 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 9a6169d73187f8112f8eef75c6e40dc4

- tr
- target_train.txt.xz21 MB
- target_dev.txt.xz515 kB
- statmt_2017_17_train_target_sentences.txt.xz2 GB
- target_test.txt.xz909 kB
- Name
- hr.zip
- Size
- 327.78 MB
- Format
- application/zip
- Description
- Zip
- MD5
- 2ab2f0a4ffe91ec9afd007817c3f003b

- hr
- target_train.txt.xz24 MB
- target_dev.txt.xz569 kB
- statmt_2017_17_train_target_sentences.txt.xz302 MB
- target_test.txt.xz1 MB
- Name
- ro.zip
- Size
- 718.05 MB
- Format
- application/zip
- Description
- Zip
- MD5
- a1d886a46f25c3b59404c6d15fba862d

- ro
- target_train.txt.xz26 MB
- target_dev.txt.xz623 kB
- statmt_2017_17_train_target_sentences.txt.xz689 MB
- target_test.txt.xz1 MB
- Name
- pl.zip
- Size
- 1.47 GB
- Format
- application/zip
- Description
- Zip
- MD5
- 3af730ee2899c7bcb54fc6a50e2c0d1e

- pl
- target_train.txt.xz32 MB
- target_dev.txt.xz584 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB

