This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

Corpus for training and evaluating diacritics restoration systems

Please use the following text to cite this item or export to a predefined format:
Náplava, Jakub; Straka, Milan; Hajič, Jan and Straňák, Pavel, 2018, Corpus for training and evaluating diacritics restoration systems, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-2607.
Date issued
2018-01-31
Size
48 entries
Description
Corpus of texts in 12 languages. For each language, we provide one training, one development and one testing set acquired from Wikipedia articles. Moreover, each language dataset contains (substantially larger) training set collected from (general) Web texts. All sets, except for Wikipedia and Web training sets that can contain similar sentences, are disjoint. Data are segmented into sentences which are further word tokenized. All data in the corpus contain diacritics. To strip diacritics from them, use Python script diacritization_stripping.py contained within attached stripping_diacritics.zip. This script has two modes. We generally recommend using method called uninames, which for some languages behaves better. The code for training recurrent neural-network based model for diacritics restoration is located at https://github.com/arahusky/diacritics_restoration.
Acknowledgement
 Files in this item
Name
stripping_diacritics.zip
Size
15.91 KB
Format
application/zip
Description
Zip
MD5
0a3f98a7a17534acaae8aaac461cd9fa
Preview
  File Preview
    • diacritization_stripping_data.py55 kB
    • diacritization_stripping.py1 kB
Name
lv.zip
Size
148.56 MB
Format
application/zip
Description
Zip
MD5
48880586dc8e16285d6f2c7cf121fe5b
Preview
  File Preview
  • lv
    • target_train.txt.xz8 MB
    • target_dev.txt.xz498 kB
    • statmt_2017_17_train_target_sentences.txt.xz138 MB
    • target_test.txt.xz952 kB
Name
vi.zip
Size
1.05 GB
Format
application/zip
Description
Zip
MD5
987b9e324056d52bd27ce3bfb3a0ebe2
Preview
  File Preview
  • vi
    • target_train.txt.xz23 MB
    • target_dev.txt.xz587 kB
    • statmt_2017_17_train_target_sentences.txt.xz1 GB
    • target_test.txt.xz1 MB
Name
sk.zip
Size
520.57 MB
Format
application/zip
Description
Zip
MD5
6e925c378a93b736bd40ca9324fadb21
Preview
  File Preview
  • sk
    • target_train.txt.xz17 MB
    • target_dev.txt.xz564 kB
    • statmt_2017_17_train_target_sentences.txt.xz501 MB
    • target_test.txt.xz1 MB
Name
ga.zip
Size
10.62 MB
Format
application/zip
Description
Zip
MD5
02f92d01d9e8fe839ebdc3a11d923ada
Preview
  File Preview
  • ga
    • target_train.txt.xz1 MB
    • target_dev.txt.xz507 kB
    • statmt_2017_17_train_target_sentences.txt.xz7 MB
    • target_test.txt.xz966 kB
Name
hu.zip
Size
1.81 GB
Format
application/zip
Description
Zip
MD5
baa30350736a10639ea6b584ba1632ea
Preview
  File Preview
  • hu
    • target_train.txt.xz36 MB
    • target_dev.txt.xz570 kB
    • statmt_2017_17_train_target_sentences.txt.xz1 GB
    • target_test.txt.xz1 MB
Name
cs.zip
Size
2.12 GB
Format
application/zip
Description
Zip
MD5
6b03e4cbceaf597bbbb699c7043bd556
Preview
  File Preview
  • cs
    • target_train.txt.xz28 MB
    • target_dev.txt.xz537 kB
    • statmt_2017_17_train_target_sentences.txt.xz2 GB
    • target_test.txt.xz1 MB
Name
es.zip
Size
3.18 GB
Format
application/zip
Description
Zip
MD5
f4414bbe3f9128553ebb589e3148f9ae
Preview
  File Preview
  • es
    • target_train.txt.xz61 MB
    • target_dev.txt.xz645 kB
    • statmt_2017_17_train_target_sentences.txt.xz3 GB
    • target_test.txt.xz1 MB
Name
fr.zip
Size
3.17 GB
Format
application/zip
Description
Zip
MD5
ee53f56e4b9907d6f7285460147dce58
Preview
  File Preview
  • fr
    • target_train.txt.xz59 MB
    • target_dev.txt.xz626 kB
    • statmt_2017_17_train_target_sentences.txt.xz3 GB
    • target_test.txt.xz1 MB
Name
tr.zip
Size
2.88 GB
Format
application/zip
Description
Zip
MD5
9a6169d73187f8112f8eef75c6e40dc4
Preview
  File Preview
  • tr
    • target_train.txt.xz21 MB
    • target_dev.txt.xz515 kB
    • statmt_2017_17_train_target_sentences.txt.xz2 GB
    • target_test.txt.xz909 kB
Name
hr.zip
Size
327.78 MB
Format
application/zip
Description
Zip
MD5
2ab2f0a4ffe91ec9afd007817c3f003b
Preview
  File Preview
  • hr
    • target_train.txt.xz24 MB
    • target_dev.txt.xz569 kB
    • statmt_2017_17_train_target_sentences.txt.xz302 MB
    • target_test.txt.xz1 MB
Name
ro.zip
Size
718.05 MB
Format
application/zip
Description
Zip
MD5
a1d886a46f25c3b59404c6d15fba862d
Preview
  File Preview
  • ro
    • target_train.txt.xz26 MB
    • target_dev.txt.xz623 kB
    • statmt_2017_17_train_target_sentences.txt.xz689 MB
    • target_test.txt.xz1 MB
Name
pl.zip
Size
1.47 GB
Format
application/zip
Description
Zip
MD5
3af730ee2899c7bcb54fc6a50e2c0d1e
Preview
  File Preview
  • pl
    • target_train.txt.xz32 MB
    • target_dev.txt.xz584 kB
    • statmt_2017_17_train_target_sentences.txt.xz1 GB
    • target_test.txt.xz1 MB