Show simple item record

 
dc.contributor.author Lehečka, Jan
dc.contributor.author Švec, Jan
dc.date.accessioned 2021-09-20T12:33:51Z
dc.date.available 2021-09-20T12:33:51Z
dc.date.issued 2021-09-20
dc.identifier.uri http://hdl.handle.net/11234/1-3776
dc.description The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
dc.language.iso ces
dc.publisher University of West Bohemia, Department of Cybernetics
dc.rights Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject Czech
dc.subject BERT
dc.title FERNET-C5
dc.type languageDescription
metashare.ResourceInfo#ContentInfo.mediaType text
metashare.ResourceInfo#ContentInfo.detailedType mlmodel
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Pavel Ircing ircing@kky.zcu.cz University of West Bohemia
files.size 1616041874
files.count 7


 Files in this item

Icon
Name
README.txt
Size
2.81 KB
Format
Text file
Description
README
MD5
a80bf9f333897324aac916d96ea81029
 Download file  Preview
 File Preview  
The FERNET-C5 is a monolingual BERT language representation model trained 
from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - 
a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records
are selected from the Common Crawl archives and the resulting data are cleaned using a simple 
yet rigorous cleaning process which removed about 98% of downloaded
plain texts, mainly due to the deduplication rules. The final cleaned dataset 
still contains almost 13 billion words (93 GB of text data).

The model has the same architecture as the original BERT model [Devlin19], 
i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons.
In contrast to Google’s BERT models, we used SentencePiece tokenization
instead of the Google’s internal WordPiece tokenization. We trained a BPE
SentencePiece model from the underlying dataset with vocabulary size set to
100 thousands. Since the datasets contains a small portion of non-Czech text . . .
                                            
Icon
Name
config.json
Size
589 bytes
Format
Unknown
Description
Model configuration
MD5
acebeb861ad4889b45fc8c2fd204d9e7
 Download file
Icon
Name
pytorch_model.bin
Size
623.92 MB
Format
Unknown
Description
Model weights for pytorch back-end
MD5
a0dff38dfaced0d79713b3bd64c7e17b
 Download file
Icon
Name
tf_model.h5
Size
916.32 MB
Format
Unknown
Description
Model weights for tensorflow back-end
MD5
464550d4326635eee4903f04ee038cb6
 Download file
Icon
Name
tokenizer_config.json
Size
239 bytes
Format
Unknown
Description
Tokenizer configuration
MD5
12406c6951763e0023a7685628dd3090
 Download file
Icon
Name
vocab.txt
Size
951.39 KB
Format
Text file
Description
Tokenizer vocabulary
MD5
b4ed849cc8683e49ded54d2fe92c671e
 Download file  Preview
 File Preview  
[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
p
s
n
v
,
.
t
d
##st
z
j
##ní
##ov
##ch
a
k
m
##ro
##le
o
po
##li
##ra
##la
##ou
##ce
na
se
##ho
b
##en
ne
##te
##ře
je
##in
##ak
do
##em
##sk
##va
##ně
##ři
##de
##že
##rá
##lo
u
pro
"
##ni
##to
za
##ří
to
##ti
##na
##ci
č
st
##al
##ku
##an
##no
##ru
##ko
##at
vy
##po
by
##jí
h
##ka
##je
pře
##ých
##re
ve
##ná
##vě
##dy
##ší
##it
ko
že
##ky
##ého
ro
##še
kte
##me
##át
##il
##cí
tak
##tě
js
##ar
##vo
pod
ob
od
##né
##mi
##er
zá
i
##lu
##sí
##lá
##ad
##ne
##ck
ná
##mě
při
P
##ová
jed
de
##av
##se
##or
##dě
##mu
f
M
##tí
ale
##ze
S
##sti
dva
ú
mě
kter
mo
ně
##by
##ty
jak
pr
##di
##du
vý
K
##ká
##da
si
##ví
##lí
le
bu
ho
ce
##nu
pří
##tu
##bo
##ji
##mo
##ové
##ly
##vá
před
B
##če
ch
##zi
ti
##ta
##ve
##lé
##ri
##dní
##vní
##ým
##ova
##ské
-
nej
##as
##do
##ma
te
##ny
##c . . .
                                            
Icon
Name
special_tokens_map.json
Size
112 bytes
Format
Unknown
Description
List of special tokens
MD5
8b3fb1023167bb4ab9d70708eb05f6ec
 Download file

Show simple item record