Deltacorpus 1.1

This is a searchable version of the Deltacorpus version 1.1, as available from the LINDAT repository.

The corpus consists of texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), the first 1,000,000 tokens for each language, that were tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). The VRT files were converted to TEI/XML, and for the 48 languages for which there is a UDPipe model, the documents were furthermore automatically parsed and tagged.

The POS provided by the delexicalized tagged is labelled @pos, the one provided by UDPipe is labelled @upos. Since both are assigned automatically, there is no real way to specify the accuracy of the delexicalized tagger, but the table below provides an impression, listing the accuracy of the UDPipe model, and the percentage of overlap between the two POS tags, which hence would be the accuracy of the delexicalized tagger if the UDPipe output were perfect.

Lang	files	tokens	matching	avg match	min match	max match
bel	47	966232	595839	61.6662457877611	44.6131892773418	63.6985959126568
afr	49	985279	566056	57.4513411937126	56.1087553005737	59.9394841746314
bul	22	454589	306374	67.3958234801106	59.521737729221	69.6076111137407
heb	1	30029	14500	48.2866562323088	48.2866562323088	48.2866562323088
nld	3	60121	40817	67.8914189717403	67.3725646519508	68.2111261058629
glg	2	40059	25511	63.6835667390599	63.5824022346369	63.7849182949378

tot	124	2536309	1549097	61.0768246298065	44.6131892773418	69.6076111137407