Latest Additions
On this page we keep you informed about the latest corpora added to the TEITOK@LINDAT family.
- ParlaMint 4.1
- The latest version of the ParlaMint corpus (version 4.1) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe
- UD 2.14
- The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 32M tokens in 283 manually verified UD parsed treebanks in 161 different languages.
- ParlaMint 4.0
- The latest version of the ParlaMint corpus (version 4.0) as a searchable TEITOK corpus. ParlaMint is a collection of comparable and uniformly annotated corpora of parliamentary debates in Europe.
- UD 2.12
- The latest version of the Universal Dependencies treebanks (version 2.12) as a searchable TEITOK corpus. The UD corpus contains 29M tokens in manually verified UD parsed treebanks in 138 different languages.
- EHRI
- A live corpus consisting of the texts from the various digital editions of the European Holocaust Research Infrastructure.
- Multilingual Newspaper Corpus
- The Multilingual Newspaper Corpus is a large corpus, containing newspaper articles in many different languages, meant primarly as a resource for Less-Resourced-Languages, but also provides comparable data for a wide range of languages. MuNeCo is a monitor corpus that will grow over time, currently consising of texts from 133 languages and 185 different newspapers, with a total size of 840 millions tokens, with linguistic annotations where possible.
- SIR 1.0
- We created a searchable version of the SIR 1.0 corpus, created from the LINDAT repository. The SiR corpus is a collection of articles published on the iRozhlas server with a manual annotation of citations.
- Corpus of Czech Verse
- We created a searchable version of the Corpus of Czech Verse, created from the downloadable Git repository on the 4th of January, 2023. The Corpus of Czech Verse (CCV) is a lemmatized, phonetically, morphologically, metrically, and strophically annotated corpus of Czech poetry of the 19th century and of the beginning of the 20th century, created by the Institute of Czech Literature.