dc.contributor.author | Asier, Gutiérrez-Fandiño |
dc.contributor.author | David, Pérez-Fernández |
dc.contributor.author | Jordi, Armengol-Estapé |
dc.contributor.author | David, Griol |
dc.contributor.author | Zoraida, Callejas |
dc.date.accessioned | 2022-08-03T13:32:00Z |
dc.date.available | 2022-08-03T13:32:00Z |
dc.date.issued | 2022-07-01 |
dc.identifier.other | http://hdl.handle.net/11234/1-4807 |
dc.identifier.uri | http://hdl.handle.net/11372/LRT-4807 |
dc.description | In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license. |
dc.language.iso | spa |
dc.publisher | LHF Labs |
dc.relation.isreferencedby | https://arxiv.org/pdf/2206.15147.pdf |
dc.rights | Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
dc.source.uri | https://huggingface.co/datasets/LHF/escorpius |
dc.subject | spanish crawling corpus |
dc.subject | crawling corpus |
dc.subject | spanish corpus |
dc.subject | massive corpus |
dc.subject | large corpus |
dc.subject | clean |
dc.subject | deduplicated |
dc.title | esCorpius: A Massive Spanish Crawling Corpus |
dc.type | corpus |
metashare.ResourceInfo#ContentInfo.mediaType | text |
dc.rights.label | PUB |
has.files | yes |
branding | LRT + Open Submissions |
demo.uri | https://huggingface.co/datasets/LHF/escorpius |
contact.person | David Pérez-Fernández david.perez@inv.uam.es Universidad Autónoma de Madrid |
contact.person | Asier Gutiérrez-Fandiño asier@lhf.ai LHF Labs |
size.info | 322.5 gb |
size.info | 2421598201 sentences |
size.info | 50040055322 tokens |
files.size | 127717012916 |
files.count | 35 |
Files in this item
This item is
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
Publicly Available
and licensed under:Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)
- Name
- README.md
- Size
- 3.76 KB
- Format
- Unknown
- Description
- Readme
- MD5
- 433934d90e6ab4acb258ca8439e47d73
- Name
- es_corpus.jsonl.aa.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 8dba4ae8bb01221cc5cc502d8c65c1bf
- Name
- es_corpus.jsonl.ab.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- a7ace259d5b86c495aca9b5934bd370e
- Name
- es_corpus.jsonl.ac.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 7c0b06362ad6c45d9a347939f8bbd417
- Name
- es_corpus.jsonl.ad.gz
- Size
- 3.68 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 0de1f6fec29626ab9ac484f42cf37005
- Name
- es_corpus.jsonl.ae.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- afa65cba79c27b623dbeb01328002c65
- Name
- es_corpus.jsonl.af.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 1745ebf5bfdc7663511391f13a890d27
- Name
- es_corpus.jsonl.ag.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 20727a6f57e8d1ebb6279838e66ba60e
- Name
- es_corpus.jsonl.ah.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- e27d6e036aadece16471a6be2f57237e
- Name
- es_corpus.jsonl.ai.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- b62440a395934806dd14961ee16705d2
- Name
- es_corpus.jsonl.aj.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 46a1099e9b10a5c7212501b6786c1be9
- Name
- es_corpus.jsonl.ak.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- ea72bade304a260c854b87c0f9fa2a09
- Name
- es_corpus.jsonl.al.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 11ff670ba344e6f4b71041b74b4fc96f
- Name
- es_corpus.jsonl.am.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- a213850f0f805b02a054dc8f2b660079
- Name
- es_corpus.jsonl.an.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- d529a79fcaf8626055372fee959d30bb
- Name
- es_corpus.jsonl.ao.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 3d0c5c0f8afdb5e5f4232e609476efb7
- Name
- es_corpus.jsonl.ap.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 8f5d3c5bf55a61f8fdd68893f8720f8a
- Name
- es_corpus.jsonl.aq.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 1f752f333933df2de0f0c6c45d510f70
- Name
- es_corpus.jsonl.ar.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- f75760bb0587cec3b857908b0df17c0f
- Name
- es_corpus.jsonl.as.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 8ea95a1173617514c3cdb34b11c3fe58
- Name
- es_corpus.jsonl.at.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- ca1455e2689aa0cb07f743a85cff5abd
- Name
- es_corpus.jsonl.au.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 02773af58d4a850360629723abfac7cd
- Name
- es_corpus.jsonl.av.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- f7e8060d60c1f9900a89f1bb289b96cf
- Name
- es_corpus.jsonl.aw.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- b76c5612ace2a1d63613c73cd68c8dc5
- Name
- es_corpus.jsonl.ax.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 7e451c06fa09c3f8dbe203f4ded90bdb
- Name
- es_corpus.jsonl.ay.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 6bcbd525cff2edc1a412e62bb551fb30
- Name
- es_corpus.jsonl.az.gz
- Size
- 3.68 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 6357cea12019987511678891c1d04801
- Name
- es_corpus.jsonl.ba.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 7623254f4f1cf197ff8a3e17bbf0d1cb
- Name
- es_corpus.jsonl.bb.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 261a99dc2aaea6b3a12dfe6fbe0b8a8c
- Name
- es_corpus.jsonl.bc.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 63302075d9fe23c60bbfeca37ec6d0c2
- Name
- es_corpus.jsonl.bd.gz
- Size
- 3.69 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 6fae421f956ef864c7560370ff07c66a
- Name
- es_corpus.jsonl.be.gz
- Size
- 3.68 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 46f299c0ef99402476f76177dca46b5f
- Name
- es_corpus.jsonl.bf.gz
- Size
- 3.68 GB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- 5b7ec55cab96f0b604b4cd60490fe7a9
- Name
- es_corpus.jsonl.bg.gz
- Size
- 936.87 MB
- Format
- application/x-gzip
- Description
- data chunk
- MD5
- cf0c7c680b86994c960728b9c8046c67
- Name
- escorpius.sha256
- Size
- 2.74 KB
- Format
- Unknown
- Description
- sha256 sums of the uncompressed data chunks
- MD5
- e3840379cebfbfae96b8a56661465ced