dc.contributor.author | Lehečka, Jan |
dc.contributor.author | Švec, Jan |
dc.date.accessioned | 2021-09-20T12:33:51Z |
dc.date.available | 2021-09-20T12:33:51Z |
dc.date.issued | 2021-09-20 |
dc.identifier.uri | http://hdl.handle.net/11234/1-3776 |
dc.description | The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5 |
dc.language.iso | ces |
dc.publisher | University of West Bohemia, Department of Cybernetics |
dc.rights | Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.subject | Czech |
dc.subject | BERT |
dc.title | FERNET-C5 |
dc.type | languageDescription |
metashare.ResourceInfo#ContentInfo.mediaType | text |
metashare.ResourceInfo#ContentInfo.detailedType | mlmodel |
dc.rights.label | PUB |
has.files | yes |
branding | LINDAT / CLARIAH-CZ |
contact.person | Pavel Ircing ircing@kky.zcu.cz University of West Bohemia |
files.size | 1616041874 |
files.count | 7 |
Soubory tohoto záznamu
Licenční kategorie:
Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Publicly Available
Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Název
- README.txt
- Velikost
- 2.81 KB
- Formát
- Textový soubor
- Popis
- README
- MD5
- a80bf9f333897324aac916d96ea81029
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records are selected from the Common Crawl archives and the resulting data are cleaned using a simple yet rigorous cleaning process which removed about 98% of downloaded plain texts, mainly due to the deduplication rules. The final cleaned dataset still contains almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model [Devlin19], i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. We trained a BPE SentencePiece model from the underlying dataset with vocabulary size set to 100 thousands. Since the datasets contains a small portion of non-Czech text . . .
- Název
- config.json
- Velikost
- 589 bajtů
- Formát
- Neznámý
- Popis
- Model configuration
- MD5
- acebeb861ad4889b45fc8c2fd204d9e7
- Název
- pytorch_model.bin
- Velikost
- 623.92 MB
- Formát
- Neznámý
- Popis
- Model weights for pytorch back-end
- MD5
- a0dff38dfaced0d79713b3bd64c7e17b
- Název
- tf_model.h5
- Velikost
- 916.32 MB
- Formát
- Neznámý
- Popis
- Model weights for tensorflow back-end
- MD5
- 464550d4326635eee4903f04ee038cb6
- Název
- tokenizer_config.json
- Velikost
- 239 bajtů
- Formát
- Neznámý
- Popis
- Tokenizer configuration
- MD5
- 12406c6951763e0023a7685628dd3090
- Název
- vocab.txt
- Velikost
- 951.39 KB
- Formát
- Textový soubor
- Popis
- Tokenizer vocabulary
- MD5
- b4ed849cc8683e49ded54d2fe92c671e
[PAD] [UNK] [CLS] [SEP] [MASK] p s n v , . t d ##st z j ##ní ##ov ##ch a k m ##ro ##le o po ##li ##ra ##la ##ou ##ce na se ##ho b ##en ne ##te ##ře je ##in ##ak do ##em ##sk ##va ##ně ##ři ##de ##že ##rá ##lo u pro " ##ni ##to za ##ří to ##ti ##na ##ci č st ##al ##ku ##an ##no ##ru ##ko ##at vy ##po by ##jí h ##ka ##je pře ##ých ##re ve ##ná ##vě ##dy ##ší ##it ko že ##ky ##ého ro ##še kte ##me ##át ##il ##cí tak ##tě js ##ar ##vo pod ob od ##né ##mi ##er zá i ##lu ##sí ##lá ##ad ##ne ##ck ná ##mě při P ##ová jed de ##av ##se ##or ##dě ##mu f M ##tí ale ##ze S ##sti dva ú mě kter mo ně ##by ##ty jak pr ##di ##du vý K ##ká ##da si ##ví ##lí le bu ho ce ##nu pří ##tu ##bo ##ji ##mo ##ové ##ly ##vá před B ##če ch ##zi ti ##ta ##ve ##lé ##ri ##dní ##vní ##ým ##ova ##ské - nej ##as ##do ##ma te ##ny ##c . . .
- Název
- special_tokens_map.json
- Velikost
- 112 bajtů
- Formát
- Neznámý
- Popis
- List of special tokens
- MD5
- 8b3fb1023167bb4ab9d70708eb05f6ec