This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

Deltacorpus 1.1

Please use the following text to cite this item or export to a predefined format:
Mareček, David; Yu, Zhiwei; Zeman, Daniel and Žabokrtský, Zdeněk, 2016, Deltacorpus 1.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11234/1-1743.
Date issued
2016-06-20
Size
94307862 tokens
Description
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). Changes in version 1.1: 1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset. 2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
Acknowledgement

Version History

Showing 1 - 2 out of 2 results
VersionDateSummary
2*
2016-06-20 00:00:00
2016-03-17 00:00:00
* Selected version
 Files in this item
Name
deltacorpus-1.1.tar
Size
438.95 MB
Format
application/x-tar
Description
tar Archive
MD5
6420ab90b7edca2dfc1a7269c1c3cbf7
Preview
  File Preview
  • deltacorpus-1.1
    • LANGUAGES.txt5 kB
    • README.txt953 B
    • data
      • tgk.txt.gz4 MB
      • mal.txt.gz5 MB
      • pam.txt.gz4 MB
      • bos.txt.gz4 MB
      • jav.txt.gz4 MB
      • bel.txt.gz4 MB
      • hrv.txt.gz4 MB
      • ben.txt.gz5 MB
      • slv.txt.gz4 MB
      • aze.txt.gz4 MB
      • spa.txt.gz4 MB
      • fra.txt.gz4 MB
      • ron.txt.gz4 MB
      • hin.txt.gz4 MB
      • hat.txt.gz3 MB
      • war.txt.gz2 MB
      • dan.txt.gz4 MB
      • hbs.txt.gz4 MB
      • kur.txt.gz4 MB
      • pol.txt.gz4 MB
      • hsb.txt.gz201 kB
      • epo.txt.gz4 MB
      • lat.txt.gz4 MB
      • lav.txt.gz4 MB
      • arz.txt.gz4 MB
      • tam.txt.gz5 MB
      • nds.txt.gz3 MB
      • vie.txt.gz3 MB
      • rus.txt.gz4 MB
      • sqi.txt.gz4 MB
      • ind.txt.gz4 MB
      • swe.txt.gz4 MB
      • nep.txt.gz5 MB
      • vol.txt.gz744 kB
      • arg.txt.gz4 MB
      • bpy.txt.gz5 MB
      • guj.txt.gz4 MB
      • deu.txt.gz4 MB
      • hye.txt.gz4 MB
      • hif.txt.gz4 MB
      • msa.txt.gz4 MB
      • uzb.txt.gz4 MB
      • wln.txt.gz632 kB
      • fry.txt.gz4 MB
      • yid.txt.gz4 MB
      • sah.txt.gz5 MB
      • kor.txt.gz5 MB
      • diq.txt.gz1 MB
      • isl.txt.gz4 MB
      • swa.txt.gz4 MB
      • eus.txt.gz4 MB
      • cym.txt.gz3 MB
      • vec.txt.gz4 MB
      • cat.txt.gz4 MB
      • amh.txt.gz39 kB
      • urd.txt.gz4 MB
      • nap.txt.gz1 MB
      • tat.txt.gz5 MB
      • kaz.txt.gz5 MB
      • lmo.txt.gz3 MB
      • gsw.txt.gz4 MB
      • glk.txt.gz2 MB
      • ara.txt.gz4 MB
      • new.txt.gz304 kB
      • mon.txt.gz4 MB
      • eng.txt.gz4 MB
      • sun.txt.gz1 MB
      • pms.txt.gz1 MB
      • sco.txt.gz4 MB
      • tgl.txt.gz4 MB
      • heb.txt.gz4 MB
      • bul.txt.gz4 MB
      • tel.txt.gz5 MB
      • ita.txt.gz4 MB
      • mri.txt.gz4 MB
      • fas.txt.gz4 MB
      • kat.txt.gz5 MB
      • gle.txt.gz4 MB
      • glg.txt.gz4 MB
      • chv.txt.gz70 kB
      • ukr.txt.gz4 MB
      • hun.txt.gz4 MB
      • fao.txt.gz4 MB
      • lim.txt.gz4 MB
      • ido.txt.gz1 MB
      • ast.txt.gz4 MB
      • afr.txt.gz4 MB
      • gla.txt.gz3 MB
      • mlg.txt.gz3 MB
      • ina.txt.gz3 MB
      • mar.txt.gz5 MB
      • slk.txt.gz4 MB
      • tur.txt.gz4 MB
      • ltz.txt.gz4 MB
      • kan.txt.gz5 MB
      • ell.txt.gz4 MB
      • ces.txt.gz4 MB
      • bre.txt.gz3 MB
      • nor.txt.gz4 MB
      • fin.txt.gz4 MB
      • por.txt.gz4 MB
      • srp.txt.gz4 MB
      • lit.txt.gz4 MB
      • est.txt.gz4 MB
      • nno.txt.gz4 MB
      • mkd.txt.gz4 MB
      • nld.txt.gz4 MB
    • POS_TAGSET.txt584 B