ParaCrawl Corpus version 1.0
Please use the following text to cite this item or export to a predefined format:
Koehn, Philipp; et al., 2018,
ParaCrawl Corpus version 1.0, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11372/LRT-2610.
Authors
Koehn, Philipp ; et al.
Item identifier
Project URL
Date issued
2018-01-14
Description
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
Publisher
Acknowledgement
European Union
Project code:CEF-TC-2016-3 / Action No: 2016-EU-IA-0114
Project name:Connecting Europe Facility (CEF)
Collections
Files in this item
- Name
- paracrawl-release1.en-lv.zipporah0-dedup-clean.tgz
- Size
- 16.05 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- b919a332bee5ea19d4f2c40849674198

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-fi.zipporah0-dedup-clean.tgz
- Size
- 34.46 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 1e8b9e321104802d173a47034fc40d85

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-et.zipporah0-dedup-clean.tgz
- Size
- 74.66 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 1267baa8e67df8985f5bbeaf7c7b3df6

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-pl.zipporah0-dedup-clean.tgz
- Size
- 85.68 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 06fd628e9dc72a6828e5fa84415c33ef

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-nl.zipporah0-dedup-clean.tgz
- Size
- 167.95 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- fe68e3b965637cff79ac33d1f4fef9ae

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-ro.zipporah0-dedup-clean.tgz
- Size
- 105.05 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 80d289499db5187dcf85b8b7d32b8f6f

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-ru.zipporah0-dedup-clean.tgz
- Size
- 637.04 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- cffdaa673e730138da4f828de040f111

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-pt.zipporah0-dedup-clean.tgz
- Size
- 221.55 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- f50ccea47196f8251cac106bc4d85e5d

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-it.zipporah0-dedup-clean.tgz
- Size
- 593.03 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 53dd104c43a798091b55174034b4b492

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-es.zipporah0-dedup-clean.tgz
- Size
- 1.26 GB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 93f96a0040b84cb836f8d455bf1b29ca

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-de.zipporah0-dedup-clean.tgz
- Size
- 1.79 GB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 30e67e94d111ea675c0567e1c1aa338c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-cs.zipporah0-dedup-clean.tgz
- Size
- 285.2 MB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- c55944b07bfe66239549d6a5e47df3fc

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- paracrawl-release1.en-fr.zipporah0-dedup-clean.tgz
- Size
- 2.1 GB
- Format
- application/x-gzip
- Description
- gzip Archive
- MD5
- 89dc66bac5125a3d7aeece01809402b7

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

