Zobrazit minimální záznam
dc.contributor.author |
Koehn, Philipp |
dc.contributor.author |
Heafield, Kenneth |
dc.contributor.author |
Forcada, Mikel L. |
dc.contributor.author |
Esplà-Gomis, Miquel |
dc.contributor.author |
Ortiz-Rojas, Sergio |
dc.contributor.author |
Sánchez, Gema Ramírez |
dc.contributor.author |
Cartagena, Víctor M. Sánchez |
dc.contributor.author |
Haddow, Barry |
dc.contributor.author |
Bañón, Marta |
dc.contributor.author |
Střelec, Marek |
dc.contributor.author |
Samiotou, Anna |
dc.contributor.author |
Kamran, Amir |
dc.date.accessioned |
2018-02-12T07:41:46Z |
dc.date.available |
2018-02-12T07:41:46Z |
dc.date.issued |
2018-01-14 |
dc.identifier.uri |
http://hdl.handle.net/11372/LRT-2610 |
dc.description |
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html |
dc.language.iso |
eng |
dc.language.iso |
deu |
dc.language.iso |
fra |
dc.language.iso |
spa |
dc.language.iso |
ita |
dc.language.iso |
por |
dc.language.iso |
nld |
dc.language.iso |
pol |
dc.language.iso |
ces |
dc.language.iso |
ron |
dc.language.iso |
fin |
dc.language.iso |
lav |
dc.language.iso |
rus |
dc.language.iso |
est |
dc.publisher |
ParaCrawl |
dc.rights |
Public Domain Dedication (CC Zero) |
dc.rights.uri |
http://creativecommons.org/publicdomain/zero/1.0/ |
dc.source.uri |
http://paracrawl.eu |
dc.subject |
ParaCrawl |
dc.subject |
parallel corpus |
dc.subject |
CommonCrawl |
dc.subject |
machine translation |
dc.subject |
text corpora |
dc.title |
ParaCrawl Corpus version 1.0 |
dc.type |
corpus |
metashare.ResourceInfo#ContentInfo.mediaType |
text |
dc.rights.label |
PUB |
has.files |
yes |
branding |
LRT + Open Submissions |
contact.person |
Amir Kamran amir@taus.net TAUS |
sponsor |
European Union CEF-TC-2016-3 / Action No: 2016-EU-IA-0114 Connecting Europe Facility (CEF) euFunds |
files.size |
7857038670 |
files.count |
13 |
Zobrazit minimální záznam