Corpora

Below is the list of corpora in the TEITOK/Kontext hybrid set-up, hosted at ÚFAL. To get a larger list of TEITOK projects, see the TEITOK project page. A larger list of Kontext corpora at the UFAL institute can be found in the KonText corpus list, or in the repository. For corpora that have multiple versions in TEITOK, only the most recent version is displayed, but you can click on the version number to see all versions of the corpus. The corpora are listed by corpus type, a description of which can be found here.

Corpus Language(s) = Czech (reset)

	Acronym	Latest	Token size	Corpus Type	Corpus Status	Corpus Content	Corpus Language(s)
info	CzechVerse		13M	Specialized Corpus	live	Poetry	Czech
info	EHRI		40k	Specialized corpus	live	Letters	German, Czech, English
info	HaCzech		18k	Facsimile Corpus	stable	Handwritten texts	Czech
info	Makoň	2020-11-16	4.2M	Spoken Corpus	stable	Transcribed talks	Czech
info	Mazon		7.9k	Facsimile Corpus	live	Letters	Czech, German, English, French, Russian
info	OCRCZ		27M	Facsimile Corpus	stable	Printed material	Czech
info	PDT-C	1.0	3.9M	Treebank	stable		Czech
info	ParCzech	4.0	36M	Spoken Corpus	stable	Parliamentary sessions	Czech
info	SIR	1.0	250k	Specialized Corpus	stable	Newspaper articles	Czech
info	Skript 2015		400k	Learner Corpus	live		Czech

13 results - showing 1-13 - - click on a value to reduce selection - click on a column to sort - Search