Show simple item record

 
dc.contributor.author Koehn, Philipp
dc.contributor.author Heafield, Kenneth
dc.contributor.author Forcada, Mikel L.
dc.contributor.author Esplà-Gomis, Miquel
dc.contributor.author Ortiz-Rojas, Sergio
dc.contributor.author Sánchez, Gema Ramírez
dc.contributor.author Cartagena, Víctor M. Sánchez
dc.contributor.author Haddow, Barry
dc.contributor.author Bañón, Marta
dc.contributor.author Střelec, Marek
dc.contributor.author Samiotou, Anna
dc.contributor.author Kamran, Amir
dc.date.accessioned 2018-02-12T07:41:46Z
dc.date.available 2018-02-12T07:41:46Z
dc.date.issued 2018-01-14
dc.identifier.uri http://hdl.handle.net/11372/LRT-2610
dc.description The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
dc.language.iso eng
dc.language.iso deu
dc.language.iso fra
dc.language.iso spa
dc.language.iso ita
dc.language.iso por
dc.language.iso nld
dc.language.iso pol
dc.language.iso ces
dc.language.iso ron
dc.language.iso fin
dc.language.iso lav
dc.language.iso rus
dc.language.iso est
dc.publisher ParaCrawl
dc.rights Public Domain Dedication (CC Zero)
dc.rights.uri http://creativecommons.org/publicdomain/zero/1.0/
dc.source.uri http://paracrawl.eu
dc.subject ParaCrawl
dc.subject parallel corpus
dc.subject CommonCrawl
dc.subject machine translation
dc.subject text corpora
dc.title ParaCrawl Corpus version 1.0
dc.type corpus
metashare.ResourceInfo#ContentInfo.mediaType text
dc.rights.label PUB
has.files yes
branding LRT + Open Submissions
contact.person Amir Kamran amir@taus.net TAUS
sponsor European Union CEF-TC-2016-3 / Action No: 2016-EU-IA-0114 Connecting Europe Facility (CEF) euFunds
files.size 7857038670
files.count 13


 Files in this item

This item is
Publicly Available
and licensed under:
Public Domain Dedication (CC Zero)
Distributed under Creative Commons No Copyright
Icon
Name
paracrawl-release1.en-de.zipporah0-dedup-clean.tgz
Size
1.79 GB
Format
application/x-gzip
Description
English / German
MD5
30e67e94d111ea675c0567e1c1aa338c
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-de.zipporah0-dedup-clean.de3 GB
    • paracrawl-release1.en-de.zipporah0-dedup-clean.en2 GB
Icon
Name
paracrawl-release1.en-fr.zipporah0-dedup-clean.tgz
Size
2.1 GB
Format
application/x-gzip
Description
English / French
MD5
89dc66bac5125a3d7aeece01809402b7
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-fr.zipporah0-dedup-clean.fr3 GB
    • paracrawl-release1.en-fr.zipporah0-dedup-clean.en3 GB
Icon
Name
paracrawl-release1.en-es.zipporah0-dedup-clean.tgz
Size
1.26 GB
Format
application/x-gzip
Description
English / Spanish
MD5
93f96a0040b84cb836f8d455bf1b29ca
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-es.zipporah0-dedup-clean.en1 GB
    • paracrawl-release1.en-es.zipporah0-dedup-clean.es2 GB
Icon
Name
paracrawl-release1.en-it.zipporah0-dedup-clean.tgz
Size
593.03 MB
Format
application/x-gzip
Description
English / Italian
MD5
53dd104c43a798091b55174034b4b492
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-it.zipporah0-dedup-clean.en887 MB
    • paracrawl-release1.en-it.zipporah0-dedup-clean.it952 MB
Icon
Name
paracrawl-release1.en-pt.zipporah0-dedup-clean.tgz
Size
221.55 MB
Format
application/x-gzip
Description
English / Portuguese
MD5
f50ccea47196f8251cac106bc4d85e5d
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-pt.zipporah0-dedup-clean.pt344 MB
    • paracrawl-release1.en-pt.zipporah0-dedup-clean.en323 MB
Icon
Name
paracrawl-release1.en-nl.zipporah0-dedup-clean.tgz
Size
167.95 MB
Format
application/x-gzip
Description
English / Dutch
MD5
fe68e3b965637cff79ac33d1f4fef9ae
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-nl.zipporah0-dedup-clean.nl267 MB
    • paracrawl-release1.en-nl.zipporah0-dedup-clean.en250 MB
Icon
Name
paracrawl-release1.en-pl.zipporah0-dedup-clean.tgz
Size
85.68 MB
Format
application/x-gzip
Description
English / Polish
MD5
06fd628e9dc72a6828e5fa84415c33ef
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-pl.zipporah0-dedup-clean.en123 MB
    • paracrawl-release1.en-pl.zipporah0-dedup-clean.pl131 MB
Icon
Name
paracrawl-release1.en-cs.zipporah0-dedup-clean.tgz
Size
285.2 MB
Format
application/x-gzip
Description
English / Czech
MD5
c55944b07bfe66239549d6a5e47df3fc
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-cs.zipporah0-dedup-clean.en517 MB
    • paracrawl-release1.en-cs.zipporah0-dedup-clean.cs529 MB
Icon
Name
paracrawl-release1.en-ro.zipporah0-dedup-clean.tgz
Size
105.05 MB
Format
application/x-gzip
Description
English / Romanian
MD5
80d289499db5187dcf85b8b7d32b8f6f
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-ro.zipporah0-dedup-clean.ro189 MB
    • paracrawl-release1.en-ro.zipporah0-dedup-clean.en181 MB
Icon
Name
paracrawl-release1.en-fi.zipporah0-dedup-clean.tgz
Size
34.46 MB
Format
application/x-gzip
Description
English / Finnish
MD5
1e8b9e321104802d173a47034fc40d85
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-fi.zipporah0-dedup-clean.fi53 MB
    • paracrawl-release1.en-fi.zipporah0-dedup-clean.en51 MB
Icon
Name
paracrawl-release1.en-lv.zipporah0-dedup-clean.tgz
Size
16.05 MB
Format
application/x-gzip
Description
English / Latvian
MD5
b919a332bee5ea19d4f2c40849674198
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-lv.zipporah0-dedup-clean.lv25 MB
    • paracrawl-release1.en-lv.zipporah0-dedup-clean.en24 MB
Icon
Name
paracrawl-release1.en-ru.zipporah0-dedup-clean.tgz
Size
637.04 MB
Format
application/x-gzip
Description
English / Russian
MD5
cffdaa673e730138da4f828de040f111
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-ru.zipporah0-dedup-clean.ru1 GB
    • paracrawl-release1.en-ru.zipporah0-dedup-clean.en961 MB
Icon
Name
paracrawl-release1.en-et.zipporah0-dedup-clean.tgz
Size
74.66 MB
Format
application/x-gzip
Description
English / Estonian
MD5
1267baa8e67df8985f5bbeaf7c7b3df6
 Download file  Preview
 File Preview  
    • paracrawl-release1.en-et.zipporah0-dedup-clean.et184 MB
    • paracrawl-release1.en-et.zipporah0-dedup-clean.en186 MB

Show simple item record