EN | ES |

Deltacorpus 1.1

This is a searchable version of the Deltacorpus version 1.1, as available from the LINDAT repository.

The corpus consists of texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), the first 1,000,000 tokens for each  language, that were tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). The VRT files were converted to TEI/XML, and for the 48 languages for which there is a UDPipe model, the documents were furthermore automatically parsed and tagged.

The POS provided by the delexicalized tagged is labelled @pos, the one provided by UDPipe is labelled @upos. Since both are assigned automatically, there is no real way to specify the accuracy of the delexicalized tagger, but the table below provides an impression, listing the accuracy of the UDPipe model, and the percentage of overlap between the two POS tags, which hence would be the accuracy of the delexicalized tagger if the UDPipe output were perfect.

 

Lang files tokens matching avg match min match max match
bel 47 966232 595839 61.6662457877611 44.6131892773418 63.6985959126568
afr 49 985279 566056 57.4513411937126 56.1087553005737 59.9394841746314
bul 22 454589 306374 67.3958234801106 59.521737729221 69.6076111137407
heb 1 30029 14500 48.2866562323088 48.2866562323088 48.2866562323088
nld 3 60121 40817 67.8914189717403 67.3725646519508 68.2111261058629
glg 2 40059 25511 63.6835667390599 63.5824022346369 63.7849182949378
             
tot 124 2536309 1549097 61.0768246298065 44.6131892773418 69.6076111137407