Latest Additions

On this page we keep you informed about the latest corpora added to the TEITOK@LINDAT family.

ParlaMint 4.1: The latest version of the ParlaMint corpus (version 4.1) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe
UD 2.14: The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 32M tokens in 283 manually verified UD parsed treebanks in 161 different languages.
ParlaMint 4.0: The latest version of the ParlaMint corpus (version 4.0) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe.
UD 2.12: The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 29M tokens in manually verified UD parsed treebanks in 138 different languages.
EHRI: A live corpus consisting of the texts from the various digital editions of the European Holocaust Research Infrastructure.
Multilingual Newspaper Corpus: The Multilingual Newspaper Corpus is a large corpus, containing newspaper articles in many different languages, meant primarly as a resource for Less-Resourced-Languages, but also provides comparable data for a wide range of languages. MuNeCo is a monitor corpus that will grow over time, currently consising of texts from 133 languages and 185 different newspapers, with a total size of 840 millions tokens, with linguistic annotations where possible.
SIR 1.0: We created a searchable version of the SIR 1.0 corpus, created from the LINDAT repository. The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations.
Corpus of Czech Verse: We created a searchable version of the Corpus of Czech Verse, created from the downloadable Git repository on the 4th of January, 2023. The Corpus of Czech Verse (CCV) is a lemmatized, phonetically, morphologically, metrically, and strophically annotated corpus of Czech poetry of the 19th century and of the beginning of the 20th century, created by the Institute of Czech Literature.