This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.

Wikicorpus

Please use the following text to cite this item or export to a predefined format:
Centro de Tecnologías y Aplicaciones del Lenguaje y del Habla (TALP), 2014, Wikicorpus, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11372/LRT-1105.
Date issued
2014-07-30
Type
Description
Trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words.