Word embeddings based on a large corpus of written Czech
Please use the following text to cite this item or export to a predefined format:
Jelínek, Tomáš, 2025,
Word embeddings based on a large corpus of written Czech, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-6017.
Authors
Item identifier
Date issued
2025-10-24
Size
6 files
Language(s)
Description
This package comprises six models of Czech word embeddings: two sets with dimensions 100, 200 and 300, one for lemmas and one for word forms. They were trained by fastText (P. Bojanowski, E. Grave, A. Joulin, T. Mikolov (2016): Enriching Word Vectors with Subword Information, https://fasttext.cc/) on the SYN v13 corpus of contemporary written Czech (Křen et al. 2024, https://wiki.korpus.cz/doku.php/en:cnk:syn:verze13) based on its lemmatisation and tagging. The skipgram algorithm was used for the training, with -minn 2 and -maxn 5 for subwords.
Acknowledgement
Ministerstvo školství, mládeže a tělovýchovy
Project code:LM2023044
Project name:Český národní korpus
Subject(s)
Collections
This item isPublicly Available
and licensed under:


