Deltacorpus 1.1
This is a searchable version of the Deltacorpus version 1.1, as available from the LINDAT repository.
The corpus consists of texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), the first 1,000,000 tokens for each language, that were tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). The VRT files were converted to TEI/XML, and for the 48 languages for which there is a UDPipe model, the documents were furthermore automatically parsed and tagged.
The POS provided by the delexicalized tagged is labelled @pos, the one provided by UDPipe is labelled @upos. Since both are assigned automatically, there is no real way to specify the accuracy of the delexicalized tagger, but the table below provides an impression, listing the accuracy of the UDPipe model, and the percentage of overlap between the two POS tags, which hence would be the accuracy of the delexicalized tagger if the UDPipe output were perfect.
Lang | files | tokens | matching | avg match | min match | max match |
---|---|---|---|---|---|---|
bel | 47 | 966232 | 595839 | 61.6662457877611 | 44.6131892773418 | 63.6985959126568 |
afr | 49 | 985279 | 566056 | 57.4513411937126 | 56.1087553005737 | 59.9394841746314 |
bul | 22 | 454589 | 306374 | 67.3958234801106 | 59.521737729221 | 69.6076111137407 |
heb | 1 | 30029 | 14500 | 48.2866562323088 | 48.2866562323088 | 48.2866562323088 |
nld | 3 | 60121 | 40817 | 67.8914189717403 | 67.3725646519508 | 68.2111261058629 |
glg | 2 | 40059 | 25511 | 63.6835667390599 | 63.5824022346369 | 63.7849182949378 |
tot | 124 | 2536309 | 1549097 | 61.0768246298065 | 44.6131892773418 | 69.6076111137407 |