This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
 

W2C – Web to Corpus – tool

Please use the following text to cite this item or export to a predefined format:
Majliš, Martin, 2011, W2C – Web to Corpus – tool, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), http://hdl.handle.net/11858/00-097C-0000-0022-60D6-1.
Date issued
2011-12-20
Description
A tool used to build multilingual corpora from wikipedia. Download the web pages, convert them to plain text, identify language, etc. A set of 120 corpora collected using this tool is available at https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0022-6133-9
This item isPublicly Available
and licensed under:
 Files in this item
Name
tr46.pdf
Size
567.11 KB
Format
application/pdf
Description
Adobe PDF
MD5
824ef862d75b40fc324d54b13a592ee1
Preview
  File Preview
Name
w2c.tar.gz
Size
165.85 KB
Format
application/x-gzip
Description
gzip Archive
MD5
747d9fabca38d085e976950193029ca3
Preview
  File Preview