dc.contributor.author | Náplava, Jakub |
dc.contributor.author | Straka, Milan |
dc.contributor.author | Hajič, Jan |
dc.contributor.author | Straňák, Pavel |
dc.date.accessioned | 2018-03-05T14:37:18Z |
dc.date.available | 2018-03-05T14:37:18Z |
dc.date.issued | 2018-01-31 |
dc.identifier.uri | http://hdl.handle.net/11234/1-2607 |
dc.description | Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration. |
dc.language.iso | ces |
dc.language.iso | vie |
dc.language.iso | ron |
dc.language.iso | pol |
dc.language.iso | slk |
dc.language.iso | spa |
dc.language.iso | hrv |
dc.language.iso | gle |
dc.language.iso | lav |
dc.language.iso | hun |
dc.language.iso | fra |
dc.language.iso | tur |
dc.publisher | Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.subject | diacritical marks generation |
dc.subject | natural language correction |
dc.title | Corpus for training and evaluating diacritics restoration systems |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Jakub Náplava naplava@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL) |
sponsor | Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds |
size.info | 48 entries |
files.size | 18647731256 |
files.count | 13 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Name
- cs.zip
- Size
- 2.12 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 6b03e4cbceaf597bbbb699c7043bd556
- cs
- target_train.txt.xz28 MB
- target_dev.txt.xz537 kB
- statmt_2017_17_train_target_sentences.txt.xz2 GB
- target_test.txt.xz1 MB
- Name
- es.zip
- Size
- 3.18 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- f4414bbe3f9128553ebb589e3148f9ae
- es
- target_train.txt.xz61 MB
- target_dev.txt.xz645 kB
- statmt_2017_17_train_target_sentences.txt.xz3 GB
- target_test.txt.xz1 MB
- Name
- fr.zip
- Size
- 3.17 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- ee53f56e4b9907d6f7285460147dce58
- fr
- target_train.txt.xz59 MB
- target_dev.txt.xz626 kB
- statmt_2017_17_train_target_sentences.txt.xz3 GB
- target_test.txt.xz1 MB
- Name
- ga.zip
- Size
- 10.62 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- 02f92d01d9e8fe839ebdc3a11d923ada
- ga
- target_train.txt.xz1 MB
- target_dev.txt.xz507 kB
- statmt_2017_17_train_target_sentences.txt.xz7 MB
- target_test.txt.xz966 kB
- Name
- hr.zip
- Size
- 327.78 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- 2ab2f0a4ffe91ec9afd007817c3f003b
- hr
- target_train.txt.xz24 MB
- target_dev.txt.xz569 kB
- statmt_2017_17_train_target_sentences.txt.xz302 MB
- target_test.txt.xz1 MB
- Name
- hu.zip
- Size
- 1.81 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- baa30350736a10639ea6b584ba1632ea
- hu
- target_train.txt.xz36 MB
- target_dev.txt.xz570 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB
- Name
- lv.zip
- Size
- 148.56 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- 48880586dc8e16285d6f2c7cf121fe5b
- lv
- target_train.txt.xz8 MB
- target_dev.txt.xz498 kB
- statmt_2017_17_train_target_sentences.txt.xz138 MB
- target_test.txt.xz952 kB
- Name
- pl.zip
- Size
- 1.47 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 3af730ee2899c7bcb54fc6a50e2c0d1e
- pl
- target_train.txt.xz32 MB
- target_dev.txt.xz584 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB
- Name
- ro.zip
- Size
- 718.05 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- a1d886a46f25c3b59404c6d15fba862d
- ro
- target_train.txt.xz26 MB
- target_dev.txt.xz623 kB
- statmt_2017_17_train_target_sentences.txt.xz689 MB
- target_test.txt.xz1 MB
- Name
- sk.zip
- Size
- 520.57 MB
- Format
- application/zip
- Description
- Unknown
- MD5
- 6e925c378a93b736bd40ca9324fadb21
- sk
- target_train.txt.xz17 MB
- target_dev.txt.xz564 kB
- statmt_2017_17_train_target_sentences.txt.xz501 MB
- target_test.txt.xz1 MB
- Name
- tr.zip
- Size
- 2.88 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 9a6169d73187f8112f8eef75c6e40dc4
- tr
- target_train.txt.xz21 MB
- target_dev.txt.xz515 kB
- statmt_2017_17_train_target_sentences.txt.xz2 GB
- target_test.txt.xz909 kB
- Name
- vi.zip
- Size
- 1.05 GB
- Format
- application/zip
- Description
- Unknown
- MD5
- 987b9e324056d52bd27ce3bfb3a0ebe2
- vi
- target_train.txt.xz23 MB
- target_dev.txt.xz587 kB
- statmt_2017_17_train_target_sentences.txt.xz1 GB
- target_test.txt.xz1 MB
- Name
- stripping_diacritics.zip
- Size
- 15.91 KB
- Format
- application/zip
- Description
- script for stripping diacritics
- MD5
- 0a3f98a7a17534acaae8aaac461cd9fa