|

Latest Additions

On this page we keep you informed about the latest corpora added to the TEITOK@LINDAT family.

ParlaMint 4.1
The latest version of the ParlaMint corpus (version 4.1) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe
UD 2.14
The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 32M tokens in 283 manually verified UD parsed treebanks in 161 different languages.
ParlaMint 4.0
The latest version of the ParlaMint corpus (version 4.0) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe. 
UD 2.12
The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 29M tokens in manually verified UD parsed treebanks in 138 different languages.
EHRI
A live corpus consisting of the texts from the various digital editions of the European Holocaust Research Infrastructure
Multilingual Newspaper Corpus
The Multilingual Newspaper Corpus is a large corpus, containing newspaper articles in many different languages, meant primarly as a resource for Less-Resourced-Languages, but also provides comparable data for a wide range of languages. MuNeCo is a monitor corpus that will grow over time, currently consising of texts from 133 languages and 185 different newspapers, with a total size of 840 millions tokens, with linguistic annotations where possible.
SIR 1.0
We created a searchable version of the SIR 1.0 corpus, created from the LINDAT repository. The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations. 
Corpus of Czech Verse
We created a searchable version of the Corpus of Czech Verse, created from the downloadable Git repository on the 4th of January, 2023. The Corpus of Czech Verse (CCV) is a lemmatized, phonetically, morphologically, metrically, and strophically annotated corpus of Czech poetry of the 19th century and of the beginning of the 20th century, created by the Institute of Czech Literature.