Corpora
Below is the list of corpora in the TEITOK/Kontext hybrid set-up, hosted at ÚFAL. To get a larger list of TEITOK projects, see the TEITOK project page. A larger list of Kontext corpora at the UFAL institute can be found in the KonText corpus list, or in the repository. For corpora that have multiple versions in TEITOK, only the most recent version is displayed, but you can click on the version number to see all versions of the corpus. The corpora are listed by corpus type, a description of which can be found here.
Acronym | Latest | Token size | Corpus Type | Corpus Status | Corpus Content | Corpus Language(s) | |
---|---|---|---|---|---|---|---|
info | CzechVerse | 13M | Specialized Corpus | live | Poetry | Czech | |
info | DeltaCorpus | 1.1 | 94M | LRL Corpus | stable | Many | |
info | EHRI | 40k | Specialized corpus | live | Letters | German, Czech, English | |
info | HaCzech | 18k | Facsimile Corpus | stable | Handwritten texts | Czech | |
info | MaPCorp | Specialized Corpus | live | Poetry | Macedonian | ||
info | Makoň | 2020-11-16 | 4.2M | Spoken Corpus | stable | Transcribed talks | Czech |
info | Mazon | 7.9k | Facsimile Corpus | live | Letters | Czech, German, English, French, Russian | |
info | Migrant Stories | 400k | Specialized corpus | live | Migrant stories | English | |
info | MuNeCo | 840M | LRL Corpus | live | Newspaper articles | Many | |
info | OCRCZ | 27M | Facsimile Corpus | stable | Printed material | Czech | |
info | PDT-C | 1.0 | 3.9M | Treebank | stable | Czech | |
info | ParCzech | 4.0 | 36M | Spoken Corpus | stable | Parliamentary sessions | Czech |
info | ParlaMint | 4.1 | 1.4G | Specialized corpus | stable | Parliamentary sessions | Many |
info | SIR | 1.0 | 250k | Specialized Corpus | stable | Newspaper articles | Czech |
info | Skript 2015 | 400k | Learner Corpus | live | Czech | ||
info | Universal Dependencies | 2.14 | 32M | Treebank | stable | Many |
22 results - showing 1-22 - - click on a value to reduce selection - click on a column to sort - Search