This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

ParaCrawl Corpus version 1.0

Please use the following text to cite this item or export to a predefined format:
Koehn, Philipp; et al., 2018, ParaCrawl Corpus version 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-2610.
Date issued
2018-01-14
Description
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
Publisher
Acknowledgement
This item isPublicly Available
and licensed under:
 Files in this item
Name
paracrawl-release1.en-lv.zipporah0-dedup-clean.tgz
Size
16.05 MB
Format
application/x-gzip
Description
gzip Archive
MD5
b919a332bee5ea19d4f2c40849674198
Preview
  File Preview
Name
paracrawl-release1.en-fi.zipporah0-dedup-clean.tgz
Size
34.46 MB
Format
application/x-gzip
Description
gzip Archive
MD5
1e8b9e321104802d173a47034fc40d85
Preview
  File Preview
Name
paracrawl-release1.en-et.zipporah0-dedup-clean.tgz
Size
74.66 MB
Format
application/x-gzip
Description
gzip Archive
MD5
1267baa8e67df8985f5bbeaf7c7b3df6
Preview
  File Preview
Name
paracrawl-release1.en-pl.zipporah0-dedup-clean.tgz
Size
85.68 MB
Format
application/x-gzip
Description
gzip Archive
MD5
06fd628e9dc72a6828e5fa84415c33ef
Preview
  File Preview
Name
paracrawl-release1.en-nl.zipporah0-dedup-clean.tgz
Size
167.95 MB
Format
application/x-gzip
Description
gzip Archive
MD5
fe68e3b965637cff79ac33d1f4fef9ae
Preview
  File Preview
Name
paracrawl-release1.en-ro.zipporah0-dedup-clean.tgz
Size
105.05 MB
Format
application/x-gzip
Description
gzip Archive
MD5
80d289499db5187dcf85b8b7d32b8f6f
Preview
  File Preview
Name
paracrawl-release1.en-ru.zipporah0-dedup-clean.tgz
Size
637.04 MB
Format
application/x-gzip
Description
gzip Archive
MD5
cffdaa673e730138da4f828de040f111
Preview
  File Preview
Name
paracrawl-release1.en-pt.zipporah0-dedup-clean.tgz
Size
221.55 MB
Format
application/x-gzip
Description
gzip Archive
MD5
f50ccea47196f8251cac106bc4d85e5d
Preview
  File Preview
Name
paracrawl-release1.en-it.zipporah0-dedup-clean.tgz
Size
593.03 MB
Format
application/x-gzip
Description
gzip Archive
MD5
53dd104c43a798091b55174034b4b492
Preview
  File Preview
Name
paracrawl-release1.en-es.zipporah0-dedup-clean.tgz
Size
1.26 GB
Format
application/x-gzip
Description
gzip Archive
MD5
93f96a0040b84cb836f8d455bf1b29ca
Preview
  File Preview
Name
paracrawl-release1.en-de.zipporah0-dedup-clean.tgz
Size
1.79 GB
Format
application/x-gzip
Description
gzip Archive
MD5
30e67e94d111ea675c0567e1c1aa338c
Preview
  File Preview
Name
paracrawl-release1.en-cs.zipporah0-dedup-clean.tgz
Size
285.2 MB
Format
application/x-gzip
Description
gzip Archive
MD5
c55944b07bfe66239549d6a5e47df3fc
Preview
  File Preview
Name
paracrawl-release1.en-fr.zipporah0-dedup-clean.tgz
Size
2.1 GB
Format
application/x-gzip
Description
gzip Archive
MD5
89dc66bac5125a3d7aeece01809402b7
Preview
  File Preview