Files in this item

Icon
Name
README.txt
Size
2.81 KB
Format
Text file
Description
README
MD5
a80bf9f333897324aac916d96ea81029
 Download file  Preview
 File Preview  
The FERNET-C5 is a monolingual BERT language representation model trained 
from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - 
a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records
are selected from the Common Crawl archives and the resulting data are cleaned using a simple 
yet rigorous cleaning process which removed about 98% of downloaded
plain texts, mainly due to the deduplication rules. The final cleaned dataset 
still contains almost 13 billion words (93 GB of text data).

The model has the same architecture as the original BERT model [Devlin19], 
i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons.
In contrast to Google’s BERT models, we used SentencePiece tokenization
instead of the Google’s internal WordPiece tokenization. We trained a BPE
SentencePiece model from the underlying dataset with vocabulary size set to
100 thousands. Since the datasets contains a small portion of non-Czech text . . .
                                            
Icon
Name
config.json
Size
589 bytes
Format
Unknown
Description
Model configuration
MD5
acebeb861ad4889b45fc8c2fd204d9e7
 Download file
Icon
Name
pytorch_model.bin
Size
623.92 MB
Format
Unknown
Description
Model weights for pytorch back-end
MD5
a0dff38dfaced0d79713b3bd64c7e17b
 Download file
Icon
Name
tf_model.h5
Size
916.32 MB
Format
Unknown
Description
Model weights for tensorflow back-end
MD5
464550d4326635eee4903f04ee038cb6
 Download file
Icon
Name
tokenizer_config.json
Size
239 bytes
Format
Unknown
Description
Tokenizer configuration
MD5
12406c6951763e0023a7685628dd3090
 Download file
Icon
Name
vocab.txt
Size
951.39 KB
Format
Text file
Description
Tokenizer vocabulary
MD5
b4ed849cc8683e49ded54d2fe92c671e
 Download file  Preview
 File Preview  
[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
p
s
n
v
,
.
t
d
##st
z
j
##ní
##ov
##ch
a
k
m
##ro
##le
o
po
##li
##ra
##la
##ou
##ce
na
se
##ho
b
##en
ne
##te
##ře
je
##in
##ak
do
##em
##sk
##va
##ně
##ři
##de
##že
##rá
##lo
u
pro
"
##ni
##to
za
##ří
to
##ti
##na
##ci
č
st
##al
##ku
##an
##no
##ru
##ko
##at
vy
##po
by
##jí
h
##ka
##je
pře
##ých
##re
ve
##ná
##vě
##dy
##ší
##it
ko
že
##ky
##ého
ro
##še
kte
##me
##át
##il
##cí
tak
##tě
js
##ar
##vo
pod
ob
od
##né
##mi
##er
zá
i
##lu
##sí
##lá
##ad
##ne
##ck
ná
##mě
při
P
##ová
jed
de
##av
##se
##or
##dě
##mu
f
M
##tí
ale
##ze
S
##sti
dva
ú
mě
kter
mo
ně
##by
##ty
jak
pr
##di
##du
vý
K
##ká
##da
si
##ví
##lí
le
bu
ho
ce
##nu
pří
##tu
##bo
##ji
##mo
##ové
##ly
##vá
před
B
##če
ch
##zi
ti
##ta
##ve
##lé
##ri
##dní
##vní
##ým
##ova
##ské
-
nej
##as
##do
##ma
te
##ny
##c . . .
                                            
Icon
Name
special_tokens_map.json
Size
112 bytes
Format
Unknown
Description
List of special tokens
MD5
8b3fb1023167bb4ab9d70708eb05f6ec
 Download file