FERNET-C5

Name: FERNET-C5
License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Lehečka, Jan; Švec, Jan

FERNET-C5

LINDAT / CLARIAH-CZ

Authors: Lehečka, Jan and Švec, Jan

Item identifier: http://hdl.handle.net/11234/1-3776

Date issued: 2021-09-20

Type: languageDescription, text

Language(s): Czech

Description: The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5

Publisher: University of West Bohemia, Department of Cybernetics

Subject(s): Czech BERT

Collection(s): LINDAT / CLARIAH-CZ Data & Tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: README.txt
Size: 2.81 KB
Format: Text file
Description: README
MD5: a80bf9f333897324aac916d96ea81029

Download file Preview

File Preview

The FERNET-C5 is a monolingual BERT language representation model trained 
from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - 
a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records
are selected from the Common Crawl archives and the resulting data are cleaned using a simple 
yet rigorous cleaning process which removed about 98% of downloaded
plain texts, mainly due to the deduplication rules. The final cleaned dataset 
still contains almost 13 billion words (93 GB of text data).

The model has the same architecture as the original BERT model [Devlin19], 
i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons.
In contrast to Google’s BERT models, we used SentencePiece tokenization
instead of the Google’s internal WordPiece tokenization. We trained a BPE
SentencePiece model from the underlying dataset with vocabulary size set to
100 thousands. Since the datasets contains a small portion of non-Czech text . . .

Name: config.json
Size: 589 bytes
Format: Unknown
Description: Model configuration
MD5: acebeb861ad4889b45fc8c2fd204d9e7

Download file

Name: pytorch_model.bin
Size: 623.92 MB
Format: Unknown
Description: Model weights for pytorch back-end
MD5: a0dff38dfaced0d79713b3bd64c7e17b

Download file

Name: tf_model.h5
Size: 916.32 MB
Format: Unknown
Description: Model weights for tensorflow back-end
MD5: 464550d4326635eee4903f04ee038cb6

Download file

Name: tokenizer_config.json
Size: 239 bytes
Format: Unknown
Description: Tokenizer configuration
MD5: 12406c6951763e0023a7685628dd3090

Download file

Name: vocab.txt
Size: 951.39 KB
Format: Text file
Description: Tokenizer vocabulary
MD5: b4ed849cc8683e49ded54d2fe92c671e

Download file Preview

File Preview

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
p
s
n
v
,
.
t
d
##st
z
j
##ní
##ov
##ch
a
k
m
##ro
##le
o
po
##li
##ra
##la
##ou
##ce
na
se
##ho
b
##en
ne
##te
##ře
je
##in
##ak
do
##em
##sk
##va
##ně
##ři
##de
##že
##rá
##lo
u
pro
"
##ni
##to
za
##ří
to
##ti
##na
##ci
č
st
##al
##ku
##an
##no
##ru
##ko
##at
vy
##po
by
##jí
h
##ka
##je
pře
##ých
##re
ve
##ná
##vě
##dy
##ší
##it
ko
že
##ky
##ého
ro
##še
kte
##me
##át
##il
##cí
tak
##tě
js
##ar
##vo
pod
ob
od
##né
##mi
##er
zá
i
##lu
##sí
##lá
##ad
##ne
##ck
ná
##mě
při
P
##ová
jed
de
##av
##se
##or
##dě
##mu
f
M
##tí
ale
##ze
S
##sti
dva
ú
mě
kter
mo
ně
##by
##ty
jak
pr
##di
##du
vý
K
##ká
##da
si
##ví
##lí
le
bu
ho
ce
##nu
pří
##tu
##bo
##ji
##mo
##ové
##ly
##vá
před
B
##če
ch
##zi
ti
##ta
##ve
##lé
##ri
##dní
##vní
##ým
##ova
##ské
-
nej
##as
##do
##ma
te
##ny
##c . . .

Name: special_tokens_map.json
Size: 112 bytes
Format: Unknown
Description: List of special tokens
MD5: 8b3fb1023167bb4ab9d70708eb05f6ec

Download file