Files in this item
This item is
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Name
- README.txt
- Size
- 2.81 KB
- Format
- Text file
- Description
- README
- MD5
- a80bf9f333897324aac916d96ea81029
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records are selected from the Common Crawl archives and the resulting data are cleaned using a simple yet rigorous cleaning process which removed about 98% of downloaded plain texts, mainly due to the deduplication rules. The final cleaned dataset still contains almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model [Devlin19], i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. We trained a BPE SentencePiece model from the underlying dataset with vocabulary size set to 100 thousands. Since the datasets contains a small portion of non-Czech text . . .
- Name
- config.json
- Size
- 589 bytes
- Format
- Unknown
- Description
- Model configuration
- MD5
- acebeb861ad4889b45fc8c2fd204d9e7
- Name
- pytorch_model.bin
- Size
- 623.92 MB
- Format
- Unknown
- Description
- Model weights for pytorch back-end
- MD5
- a0dff38dfaced0d79713b3bd64c7e17b
- Name
- tf_model.h5
- Size
- 916.32 MB
- Format
- Unknown
- Description
- Model weights for tensorflow back-end
- MD5
- 464550d4326635eee4903f04ee038cb6
- Name
- tokenizer_config.json
- Size
- 239 bytes
- Format
- Unknown
- Description
- Tokenizer configuration
- MD5
- 12406c6951763e0023a7685628dd3090
- Name
- vocab.txt
- Size
- 951.39 KB
- Format
- Text file
- Description
- Tokenizer vocabulary
- MD5
- b4ed849cc8683e49ded54d2fe92c671e
[PAD] [UNK] [CLS] [SEP] [MASK] p s n v , . t d ##st z j ##ní ##ov ##ch a k m ##ro ##le o po ##li ##ra ##la ##ou ##ce na se ##ho b ##en ne ##te ##ře je ##in ##ak do ##em ##sk ##va ##ně ##ři ##de ##že ##rá ##lo u pro " ##ni ##to za ##ří to ##ti ##na ##ci č st ##al ##ku ##an ##no ##ru ##ko ##at vy ##po by ##jí h ##ka ##je pře ##ých ##re ve ##ná ##vě ##dy ##ší ##it ko že ##ky ##ého ro ##še kte ##me ##át ##il ##cí tak ##tě js ##ar ##vo pod ob od ##né ##mi ##er zá i ##lu ##sí ##lá ##ad ##ne ##ck ná ##mě při P ##ová jed de ##av ##se ##or ##dě ##mu f M ##tí ale ##ze S ##sti dva ú mě kter mo ně ##by ##ty jak pr ##di ##du vý K ##ká ##da si ##ví ##lí le bu ho ce ##nu pří ##tu ##bo ##ji ##mo ##ové ##ly ##vá před B ##če ch ##zi ti ##ta ##ve ##lé ##ri ##dní ##vní ##ým ##ova ##ské - nej ##as ##do ##ma te ##ny ##c . . .
- Name
- special_tokens_map.json
- Size
- 112 bytes
- Format
- Unknown
- Description
- List of special tokens
- MD5
- 8b3fb1023167bb4ab9d70708eb05f6ec