FERNET-C5

Name: FERNET-C5
License: http://creativecommons.org/licenses/by-nc-sa/4.0/

Lehečka, Jan; Švec, Jan

FERNET-C5

LINDAT / CLARIAH-CZ

Autoři: Lehečka, Jan and Švec, Jan

Identifikátor: http://hdl.handle.net/11234/1-3776

Datum vydání: 2021-09-20

Typ: languageDescription, text

Jazyky: Czech

Popis: The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5

Nakladatel: University of West Bohemia, Department of Cybernetics

Klíčová slova: Czech BERT

Kolekce: LINDAT / CLARIAH-CZ Data & Tools

Zobrazit celý záznam

Soubory tohoto záznamu

Licenční kategorie:

Publicly Available

Licence: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Název: README.txt
Velikost: 2.81 KB
Formát: Textový soubor
Popis: README
MD5: a80bf9f333897324aac916d96ea81029

Stáhnout soubor Náhled

Náhled souboru

The FERNET-C5 is a monolingual BERT language representation model trained 
from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - 
a Czech mutation of the English C4 dataset [Raffel20]. Only the Czech records
are selected from the Common Crawl archives and the resulting data are cleaned using a simple 
yet rigorous cleaning process which removed about 98% of downloaded
plain texts, mainly due to the deduplication rules. The final cleaned dataset 
still contains almost 13 billion words (93 GB of text data).

The model has the same architecture as the original BERT model [Devlin19], 
i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons.
In contrast to Google’s BERT models, we used SentencePiece tokenization
instead of the Google’s internal WordPiece tokenization. We trained a BPE
SentencePiece model from the underlying dataset with vocabulary size set to
100 thousands. Since the datasets contains a small portion of non-Czech text . . .

Název: config.json
Velikost: 589 bajtů
Formát: Neznámý
Popis: Model configuration
MD5: acebeb861ad4889b45fc8c2fd204d9e7

Stáhnout soubor

Název: pytorch_model.bin
Velikost: 623.92 MB
Formát: Neznámý
Popis: Model weights for pytorch back-end
MD5: a0dff38dfaced0d79713b3bd64c7e17b

Stáhnout soubor

Název: tf_model.h5
Velikost: 916.32 MB
Formát: Neznámý
Popis: Model weights for tensorflow back-end
MD5: 464550d4326635eee4903f04ee038cb6

Stáhnout soubor

Název: tokenizer_config.json
Velikost: 239 bajtů
Formát: Neznámý
Popis: Tokenizer configuration
MD5: 12406c6951763e0023a7685628dd3090

Stáhnout soubor

Název: vocab.txt
Velikost: 951.39 KB
Formát: Textový soubor
Popis: Tokenizer vocabulary
MD5: b4ed849cc8683e49ded54d2fe92c671e

Stáhnout soubor Náhled

Náhled souboru

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]
p
s
n
v
,
.
t
d
##st
z
j
##ní
##ov
##ch
a
k
m
##ro
##le
o
po
##li
##ra
##la
##ou
##ce
na
se
##ho
b
##en
ne
##te
##ře
je
##in
##ak
do
##em
##sk
##va
##ně
##ři
##de
##že
##rá
##lo
u
pro
"
##ni
##to
za
##ří
to
##ti
##na
##ci
č
st
##al
##ku
##an
##no
##ru
##ko
##at
vy
##po
by
##jí
h
##ka
##je
pře
##ých
##re
ve
##ná
##vě
##dy
##ší
##it
ko
že
##ky
##ého
ro
##še
kte
##me
##át
##il
##cí
tak
##tě
js
##ar
##vo
pod
ob
od
##né
##mi
##er
zá
i
##lu
##sí
##lá
##ad
##ne
##ck
ná
##mě
při
P
##ová
jed
de
##av
##se
##or
##dě
##mu
f
M
##tí
ale
##ze
S
##sti
dva
ú
mě
kter
mo
ně
##by
##ty
jak
pr
##di
##du
vý
K
##ká
##da
si
##ví
##lí
le
bu
ho
ce
##nu
pří
##tu
##bo
##ji
##mo
##ové
##ly
##vá
před
B
##če
ch
##zi
ti
##ta
##ve
##lé
##ri
##dní
##vní
##ým
##ova
##ské
-
nej
##as
##do
##ma
te
##ny
##c . . .

Název: special_tokens_map.json
Velikost: 112 bajtů
Formát: Neznámý
Popis: List of special tokens
MD5: 8b3fb1023167bb4ab9d70708eb05f6ec

Stáhnout soubor