|

Latest Additions

On this page we keep you informed about the latest corpora added to the TEITOK@LINDAT family.

ParlaMint 4.0
The latest version of the ParlaMint corpus (version 4.0) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe. 
UD 2.12
The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 29M tokens in manually verified UD parsed treebanks in 138 different languages.
EHRI
A live corpus consisting of the texts from the various digital editions of the European Holocaust Research Infrastructure
Multilingual Newspaper Corpus
The Multilingual Newspaper Corpus is a large corpus, containing newspaper articles in many different languages, meant primarly as a resource for Less-Resourced-Languages, but also provides comparable data for a wide range of languages. MuNeCo is a monitor corpus that will grow over time, currently consising of texts from 133 languages and 185 different newspapers, with a total size of 840 millions tokens, with linguistic annotations where possible.
SIR 1.0
We created a searchable version of the SIR 1.0 corpus, created from the LINDAT repository. The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations. 
Corpus of Czech Verse
We created a searchable version of the Corpus of Czech Verse, created from the downloadable Git repository on the 4th of January, 2023. The Corpus of Czech Verse (CCV) is a lemmatized, phonetically, morphologically, metrically, and strophically annotated corpus of Czech poetry of the 19th century and of the beginning of the 20th century, created by the Institute of Czech Literature.