The FERNET-C5 is a monolingual BERT language representation model trained 
from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - 
a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records
are selected from the Common Crawl archives and the resulting data are cleaned using a simple 
yet rigorous cleaning process which removed about 98% of downloaded
plain texts, mainly due to the deduplication rules. The final cleaned dataset 
still contains almost 13 billion words (93 GB of text data).

The model has the same architecture as the original BERT model [Devlin19], 
i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons.
In contrast to Google’s BERT models, we used SentencePiece tokenization
instead of the Google’s internal WordPiece tokenization. We trained a BPE
SentencePiece model from the underlying dataset with vocabulary size set to
100 thousands. Since the datasets contains a small portion of non-Czech text
fragments, we carefully tuned the character coverage parameter to fully cover
all Czech characters and only reasonable number of graphemes from other languages.
We kept original casing of the dataset.

Further details can be found in https://arxiv.org/abs/2107.10042

The same models are also released at  https://huggingface.co/fav-kky/FERNET-C5
----
The model files have the standard format used by the Transformers library. 

Specifically, there are the following model files:  

•	config.json - model configuration
•	pytorch_model.bin - model weights for pytorch back-end
•	tf_model.h5 - model weights for tensorflow back-end

The tokenizer files:
•	tokenizer_config.json - tokenizer configuration
•	vocab.txt - tokenizer vocabulary
•	special_tokens_map.json -  list of special tokens

An example of an interactive model usage with a Transformers library:

from transformers import pipeline
model = pipeline("fill-mask", "fav-kky/FERNET-C5")
model("Smyslem života je [MASK].")

Note: The phrase "Smyslem života je ..." means "The meaning of life is ..." and 
the model answers "život" ("life") :) 

----
[Raffel20] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring
the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research, 21(140):1–67, 2020.

[Devlin19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota, 2019. Association for Computational Linguistics