Show simple item record

 
dc.contributor.author Asier, Gutiérrez-Fandiño
dc.contributor.author David, Pérez-Fernández
dc.contributor.author Jordi, Armengol-Estapé
dc.contributor.author David, Griol
dc.contributor.author Zoraida, Callejas
dc.date.accessioned 2022-08-03T13:32:00Z
dc.date.available 2022-08-03T13:32:00Z
dc.date.issued 2022-07-01
dc.identifier.other http://hdl.handle.net/11234/1-4807
dc.identifier.uri http://hdl.handle.net/11372/LRT-4807
dc.description In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
dc.language.iso spa
dc.publisher LHF Labs
dc.relation.isreferencedby https://arxiv.org/pdf/2206.15147.pdf
dc.rights Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.source.uri https://huggingface.co/datasets/LHF/escorpius
dc.subject spanish crawling corpus
dc.subject crawling corpus
dc.subject spanish corpus
dc.subject massive corpus
dc.subject large corpus
dc.subject clean
dc.subject deduplicated
dc.title esCorpius: A Massive Spanish Crawling Corpus
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
demo.uri https://huggingface.co/datasets/LHF/escorpius
contact.person David Pérez-Fernández david.perez@inv.uam.es Universidad Autónoma de Madrid
contact.person Asier Gutiérrez-Fandiño asier@lhf.ai LHF Labs
size.info 322.5 gb
size.info 2421598201 sentences
size.info 50040055322 tokens
files.size 127717012916
files.count 35


 Files in this item

Icon
Name
README.md
Size
3.76 KB
Format
Unknown
Description
Readme
MD5
433934d90e6ab4acb258ca8439e47d73
 Download file
Icon
Name
es_corpus.jsonl.aa.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
8dba4ae8bb01221cc5cc502d8c65c1bf
 Download file
Icon
Name
es_corpus.jsonl.ab.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
a7ace259d5b86c495aca9b5934bd370e
 Download file
Icon
Name
es_corpus.jsonl.ac.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
7c0b06362ad6c45d9a347939f8bbd417
 Download file
Icon
Name
es_corpus.jsonl.ad.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
0de1f6fec29626ab9ac484f42cf37005
 Download file
Icon
Name
es_corpus.jsonl.ae.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
afa65cba79c27b623dbeb01328002c65
 Download file
Icon
Name
es_corpus.jsonl.af.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
1745ebf5bfdc7663511391f13a890d27
 Download file
Icon
Name
es_corpus.jsonl.ag.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
20727a6f57e8d1ebb6279838e66ba60e
 Download file
Icon
Name
es_corpus.jsonl.ah.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
e27d6e036aadece16471a6be2f57237e
 Download file
Icon
Name
es_corpus.jsonl.ai.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
b62440a395934806dd14961ee16705d2
 Download file
Icon
Name
es_corpus.jsonl.aj.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
46a1099e9b10a5c7212501b6786c1be9
 Download file
Icon
Name
es_corpus.jsonl.ak.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
ea72bade304a260c854b87c0f9fa2a09
 Download file
Icon
Name
es_corpus.jsonl.al.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
11ff670ba344e6f4b71041b74b4fc96f
 Download file
Icon
Name
es_corpus.jsonl.am.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
a213850f0f805b02a054dc8f2b660079
 Download file
Icon
Name
es_corpus.jsonl.an.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
d529a79fcaf8626055372fee959d30bb
 Download file
Icon
Name
es_corpus.jsonl.ao.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
3d0c5c0f8afdb5e5f4232e609476efb7
 Download file
Icon
Name
es_corpus.jsonl.ap.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
8f5d3c5bf55a61f8fdd68893f8720f8a
 Download file
Icon
Name
es_corpus.jsonl.aq.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
1f752f333933df2de0f0c6c45d510f70
 Download file
Icon
Name
es_corpus.jsonl.ar.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
f75760bb0587cec3b857908b0df17c0f
 Download file
Icon
Name
es_corpus.jsonl.as.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
8ea95a1173617514c3cdb34b11c3fe58
 Download file
Icon
Name
es_corpus.jsonl.at.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
ca1455e2689aa0cb07f743a85cff5abd
 Download file
Icon
Name
es_corpus.jsonl.au.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
02773af58d4a850360629723abfac7cd
 Download file
Icon
Name
es_corpus.jsonl.av.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
f7e8060d60c1f9900a89f1bb289b96cf
 Download file
Icon
Name
es_corpus.jsonl.aw.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
b76c5612ace2a1d63613c73cd68c8dc5
 Download file
Icon
Name
es_corpus.jsonl.ax.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
7e451c06fa09c3f8dbe203f4ded90bdb
 Download file
Icon
Name
es_corpus.jsonl.ay.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
6bcbd525cff2edc1a412e62bb551fb30
 Download file
Icon
Name
es_corpus.jsonl.az.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
6357cea12019987511678891c1d04801
 Download file
Icon
Name
es_corpus.jsonl.ba.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
7623254f4f1cf197ff8a3e17bbf0d1cb
 Download file
Icon
Name
es_corpus.jsonl.bb.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
261a99dc2aaea6b3a12dfe6fbe0b8a8c
 Download file
Icon
Name
es_corpus.jsonl.bc.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
63302075d9fe23c60bbfeca37ec6d0c2
 Download file
Icon
Name
es_corpus.jsonl.bd.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
6fae421f956ef864c7560370ff07c66a
 Download file
Icon
Name
es_corpus.jsonl.be.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
46f299c0ef99402476f76177dca46b5f
 Download file
Icon
Name
es_corpus.jsonl.bf.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
5b7ec55cab96f0b604b4cd60490fe7a9
 Download file
Icon
Name
es_corpus.jsonl.bg.gz
Size
936.87 MB
Format
application/x-gzip
Description
data chunk
MD5
cf0c7c680b86994c960728b9c8046c67
 Download file
Icon
Name
escorpius.sha256
Size
2.74 KB
Format
Unknown
Description
sha256 sums of the uncompressed data chunks
MD5
e3840379cebfbfae96b8a56661465ced
 Download file

Show simple item record