FERNET-C5
Please use the following text to cite this item or export to a predefined format:
Lehečka, Jan and Švec, Jan, 2021,
FERNET-C5, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-3776.
Authors
Item identifier
Date issued
2021-09-20
Language(s)
Description
The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization.
More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
Collections
This item isPublicly Available
and licensed under:
Files in this item
- Name
- pytorch_model.bin
- Size
- 623.92 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- a0dff38dfaced0d79713b3bd64c7e17b

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- tf_model.h5
- Size
- 916.32 MB
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 464550d4326635eee4903f04ee038cb6

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- tokenizer_config.json
- Size
- 239 B
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 12406c6951763e0023a7685628dd3090

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- README.txt
- Size
- 2.81 KB
- Format
- text/plain
- Description
- Text
- MD5
- a80bf9f333897324aac916d96ea81029

The FERNET-C5 is a monolingual BERT language representation model trained
from scratch on the Czech Colossal Clean Crawled Corpus (C5) data -
a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records
are selected from the Common Crawl archives and the resulting data are cleaned using a simple
yet rigorous cleaning process which removed about 98% of downloaded
plain texts, mainly due to the deduplication rules. The final cleaned dataset
still contains almost 13 billion words (93 GB of text data).
The model has the same architecture as the original BERT model [Devlin19],
i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons.
In contrast to Google’s BERT models, we used SentencePiece tokenization
instead of the Google’s internal WordPiece tokenization. We trained a BPE
SentencePiece model from the underlying dataset with vocabulary size set to
100 thousands. Since the datasets contains a small portion of non-Czech text
fragments, we carefully tuned the character coverage parameter to fully cover
all Czech characters and only reasonable number of graphemes from other languages.
We kept original casing of the dataset.
Further details can be found in https://arxiv.org/abs/2107.10042
The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
----
The model files have the standard format used by the Transformers library.
Specifically, there are the following model files:
• config.json - model configuration
• pytorch_model.bin - model weights for pytorch back-end
• tf_model.h5 - model weights for tensorflow back-end
The tokenizer files:
• tokenizer_config.json - tokenizer configuration
• vocab.txt - tokenizer vocabulary
• special_tokens_map.json - list of special tokens
An example of an interactive model usage with a Transformers library:
from transformers import pipeline
model = pipeline("fill-mask", "fav-kky/FERNET-C5")
model("Smyslem života je [MASK].")
Note: The phrase "Smyslem . . .- Name
- vocab.txt
- Size
- 951.39 KB
- Format
- text/plain
- Description
- Text
- MD5
- b4ed849cc8683e49ded54d2fe92c671e

[PAD] [UNK] [CLS] [SEP] [MASK] p s n v , . t d ##st z j ##ní ##ov ##ch a k m ##ro ##le o po ##li ##ra ##la ##ou ##ce na se ##ho b ##en ne ##te ##ře je ##in ##ak do ##em ##sk ##va ##ně ##ři ##de ##že ##rá ##lo u pro " ##ni ##to za ##ří to ##ti ##na ##ci č st ##al ##ku ##an ##no ##ru ##ko ##at vy ##po by ##jí h ##ka ##je pře ##ých ##re ve ##ná ##vě ##dy ##ší ##it ko že ##ky ##ého ro ##še kte ##me ##át ##il ##cí tak ##tě js ##ar ##vo pod ob od ##né ##mi ##er zá i ##lu ##sí ##lá ##ad ##ne ##ck ná ##mě při P ##ová jed de ##av ##se ##or ##dě ##mu f M ##tí ale ##ze S ##sti dva ú mě kter mo ně ##by ##ty jak pr ##di ##du vý K ##ká ##da si ##ví ##lí le bu ho ce ##nu pří ##tu ##bo ##ji ##mo ##ové ##ly ##vá před B ##če ch ##zi ti ##ta ##ve ##lé ##ri ##dní ##vní ##ým ##ova ##ské - nej ##as ##do ##ma te ##ny ##ct J š tisí roz os ##ět li ##nou ##rav ##des ##dí ##bu mu re ##co sou V ##čí ta ##pe ##vět ##ži ##ál spo ##ční ##dá tři ##ště ##ný mí ##stu ##sta ##ovat kdy H sv Č už : ka ##desát ##ry má ##ob še ##kla ##vi e ##rov ##má ##cet T ##mí ##ba L mi D ##tel ni R ji ? kon ##lov ##sto A čty ##uje jsem ##ním ##pad ##sle ře ##lou ví vo ##nost ##cho in co jako bude N ##pi ma které ##dou jsme ( ##ží ) ##men sed který ##stav ##ků pa podle jsou ##vr min jen čtyři také set ##ez ##ist osm ##on če ##cké ##náct ##bě O ##ran ##tá sedm sta tisíc ##den ##ský dal vě ##hod ##lav ##bí ##si ##jší ##či ##kov ##kou ##prav ž nebo ze vz poli první prá ##ád ##ste C ##vy ##kon další ##hu bo ##ím hod vel tře ##my dě tu ##oval tě F když ##ních ##za jedna byl aby ##pu ##ského ##ruh jeho me šest pět pra proto ##jící ##pa vš ten ří ##nosti E pot ##ém ##tin ##ování ##ných ##be Z ##jem bý ##ovo moh ##zí až ##pě ##ále g vše ##tní druh ##rů ##ali tr ##stup ##ter di ##dem tisíce ##zna ##rát ##tr devět cel dese sp sto která ##sku Š ##ově ##sa dne ##stě ještě ##ům nap ##ené ##su ra pů ##řej ##stí sa bylo ##kem ##ských té ##hra zp ##ská ##gi něk ##koli ##ha ##ená ##rou vá ##ový mů ##vat ##el dob ##ovi něj ##tek . . .
- Name
- config.json
- Size
- 589 B
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- acebeb861ad4889b45fc8c2fd204d9e7

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- special_tokens_map.json
- Size
- 112 B
- Format
- application/octet-stream
- Description
- Unknown
- MD5
- 8b3fb1023167bb4ab9d70708eb05f6ec

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

