The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records are selected from the Common Crawl archives and the resulting data are cleaned using a simple yet rigorous cleaning process which removed about 98% of downloaded plain texts, mainly due to the deduplication rules. The final cleaned dataset still contains almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model [Devlin19], i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. We trained a BPE SentencePiece model from the underlying dataset with vocabulary size set to 100 thousands. Since the datasets contains a small portion of non-Czech text fragments, we carefully tuned the character coverage parameter to fully cover all Czech characters and only reasonable number of graphemes from other languages. We kept original casing of the dataset. Further details can be found in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5 ---- The model files have the standard format used by the Transformers library. Specifically, there are the following model files: • config.json - model configuration • pytorch_model.bin - model weights for pytorch back-end • tf_model.h5 - model weights for tensorflow back-end The tokenizer files: • tokenizer_config.json - tokenizer configuration • vocab.txt - tokenizer vocabulary • special_tokens_map.json - list of special tokens An example of an interactive model usage with a Transformers library: from transformers import pipeline model = pipeline("fill-mask", "fav-kky/FERNET-C5") model("Smyslem života je [MASK].") Note: The phrase "Smyslem života je ..." means "The meaning of life is ..." and the model answers "život" ("life") :) ---- [Raffel20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. [Devlin19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics