This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

esCorpius: A Massive Spanish Crawling Corpus

Please use the following text to cite this item or export to a predefined format:
Asier, Gutiérrez-Fandiño; David, Pérez-Fernández; Jordi, Armengol-Estapé; David, Griol and Zoraida, Callejas, 2022, esCorpius: A Massive Spanish Crawling Corpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-4807.
Date issued
2022-07-01
Size
322.5 gb,
2421598201 sentences,
50040055322 tokens
Language(s)
Description
In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
Publisher
 Files in this item
Name
README.md
Size
3.76 KB
Format
application/octet-stream
Description
Readme
MD5
433934d90e6ab4acb258ca8439e47d73
Preview
  File Preview
Name
es_corpus.jsonl.aa.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
8dba4ae8bb01221cc5cc502d8c65c1bf
Preview
  File Preview
    • es_corpus.jsonl.aa9 GB
Name
es_corpus.jsonl.ab.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
a7ace259d5b86c495aca9b5934bd370e
Preview
  File Preview
    • es_corpus.jsonl.ab9 GB
Name
es_corpus.jsonl.ac.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
7c0b06362ad6c45d9a347939f8bbd417
Preview
  File Preview
    • es_corpus.jsonl.ac9 GB
Name
es_corpus.jsonl.ad.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
0de1f6fec29626ab9ac484f42cf37005
Preview
  File Preview
    • es_corpus.jsonl.ad9 GB
Name
es_corpus.jsonl.ae.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
afa65cba79c27b623dbeb01328002c65
Preview
  File Preview
    • es_corpus.jsonl.ae9 GB
Name
es_corpus.jsonl.af.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
1745ebf5bfdc7663511391f13a890d27
Preview
  File Preview
    • es_corpus.jsonl.af9 GB
Name
es_corpus.jsonl.ag.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
20727a6f57e8d1ebb6279838e66ba60e
Preview
  File Preview
    • es_corpus.jsonl.ag9 GB
Name
es_corpus.jsonl.ah.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
e27d6e036aadece16471a6be2f57237e
Preview
  File Preview
    • es_corpus.jsonl.ah9 GB
Name
es_corpus.jsonl.ai.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
b62440a395934806dd14961ee16705d2
Preview
  File Preview
    • es_corpus.jsonl.ai9 GB
Name
es_corpus.jsonl.aj.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
46a1099e9b10a5c7212501b6786c1be9
Preview
  File Preview
    • es_corpus.jsonl.aj9 GB
Name
es_corpus.jsonl.ak.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
ea72bade304a260c854b87c0f9fa2a09
Preview
  File Preview
    • es_corpus.jsonl.ak9 GB
Name
es_corpus.jsonl.al.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
11ff670ba344e6f4b71041b74b4fc96f
Preview
  File Preview
    • es_corpus.jsonl.al9 GB
Name
es_corpus.jsonl.am.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
a213850f0f805b02a054dc8f2b660079
Preview
  File Preview
    • es_corpus.jsonl.am9 GB
Name
es_corpus.jsonl.an.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
d529a79fcaf8626055372fee959d30bb
Preview
  File Preview
    • es_corpus.jsonl.an9 GB
Name
es_corpus.jsonl.ao.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
3d0c5c0f8afdb5e5f4232e609476efb7
Preview
  File Preview
    • es_corpus.jsonl.ao9 GB
Name
es_corpus.jsonl.ap.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
8f5d3c5bf55a61f8fdd68893f8720f8a
Preview
  File Preview
    • es_corpus.jsonl.ap9 GB
Name
es_corpus.jsonl.aq.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
1f752f333933df2de0f0c6c45d510f70
Preview
  File Preview
    • es_corpus.jsonl.aq9 GB
Name
es_corpus.jsonl.ar.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
f75760bb0587cec3b857908b0df17c0f
Preview
  File Preview
    • es_corpus.jsonl.ar9 GB
Name
es_corpus.jsonl.as.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
8ea95a1173617514c3cdb34b11c3fe58
Preview
  File Preview
    • es_corpus.jsonl.as9 GB
Name
es_corpus.jsonl.at.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
ca1455e2689aa0cb07f743a85cff5abd
Preview
  File Preview
    • es_corpus.jsonl.at9 GB
Name
es_corpus.jsonl.au.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
02773af58d4a850360629723abfac7cd
Preview
  File Preview
    • es_corpus.jsonl.au9 GB
Name
es_corpus.jsonl.av.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
f7e8060d60c1f9900a89f1bb289b96cf
Preview
  File Preview
    • es_corpus.jsonl.av9 GB
Name
es_corpus.jsonl.aw.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
b76c5612ace2a1d63613c73cd68c8dc5
Preview
  File Preview
    • es_corpus.jsonl.aw9 GB
Name
es_corpus.jsonl.ax.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
7e451c06fa09c3f8dbe203f4ded90bdb
Preview
  File Preview
    • es_corpus.jsonl.ax9 GB
Name
es_corpus.jsonl.ay.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
6bcbd525cff2edc1a412e62bb551fb30
Preview
  File Preview
    • es_corpus.jsonl.ay9 GB
Name
es_corpus.jsonl.az.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
6357cea12019987511678891c1d04801
Preview
  File Preview
    • es_corpus.jsonl.az9 GB
Name
es_corpus.jsonl.ba.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
7623254f4f1cf197ff8a3e17bbf0d1cb
Preview
  File Preview
    • es_corpus.jsonl.ba9 GB
Name
es_corpus.jsonl.bb.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
261a99dc2aaea6b3a12dfe6fbe0b8a8c
Preview
  File Preview
    • es_corpus.jsonl.bb9 GB
Name
es_corpus.jsonl.bc.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
63302075d9fe23c60bbfeca37ec6d0c2
Preview
  File Preview
    • es_corpus.jsonl.bc9 GB
Name
es_corpus.jsonl.bd.gz
Size
3.69 GB
Format
application/x-gzip
Description
data chunk
MD5
6fae421f956ef864c7560370ff07c66a
Preview
  File Preview
    • es_corpus.jsonl.bd9 GB
Name
es_corpus.jsonl.be.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
46f299c0ef99402476f76177dca46b5f
Preview
  File Preview
    • es_corpus.jsonl.be9 GB
Name
es_corpus.jsonl.bf.gz
Size
3.68 GB
Format
application/x-gzip
Description
data chunk
MD5
5b7ec55cab96f0b604b4cd60490fe7a9
Preview
  File Preview
    • es_corpus.jsonl.bf9 GB
Name
es_corpus.jsonl.bg.gz
Size
936.87 MB
Format
application/x-gzip
Description
data chunk
MD5
cf0c7c680b86994c960728b9c8046c67
Preview
  File Preview
    • es_corpus.jsonl.bg2 GB
Name
escorpius.sha256
Size
2.74 KB
Format
application/octet-stream
Description
sha256 sums of the uncompressed data chunks
MD5
e3840379cebfbfae96b8a56661465ced
Preview
  File Preview