Show simple item record Pomikálek, Jan 2013-02-01T16:32:21Z 2013-02-01T16:32:21Z 2011
dc.description Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
dc.description.sponsorship PRESEMT, Lexical Computing Ltd
dc.language.iso eng
dc.publisher Masaryk University, NLP Centre
dc.rights BSD 3-Clause "New" or "Revised" license
dc.subject character encoding
dc.subject character encoding detection
dc.subject charset
dc.subject unicode
dc.title Chared
dc.type toolService
metashare.ResourceInfo#ContactInfo#PersonInfo.surname Pomikálek
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName Jan
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName Natural Language Processing Centre, Faculty of Informatics Masaryk University
metashare.ResourceInfo#DistributionInfo.availability restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse attribution
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse shareAlike
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium downloadable
metashare.ResourceInfo#ValidationInfo.validated True
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName PRESEMT
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType euFunds
metashare.ResourceInfo#TextInfo#SizeInfo.size 23
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit mb
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent false
metashare.ResourceInfo#ContentInfo.detailedType tool
dc.rights.label PUB
has.files yes
sponsor PRESEMT
sponsor Lexical Computing Ltd. 23 mb
files.size 24156936
files.count 1

 Files in this item

This item is
Publicly Available
and licensed under:
BSD 3-Clause "New" or "Revised" license
BSD Attribution Required
23.04 MB
chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
 Download file  Preview
 File Preview  
  • chared-1.2.1
    • CHANGES572 B
    • chared
      • util
        • encoding.py1 kB
        • html2txt.py2 kB
        • __init__.py0 B
        • trigrams.py6 kB
      • detector.py7 kB
      • __init__.py378 B
      • models
        • czech.edm883 kB
        • japanese.edm4 MB
        • malayalam.edm133 kB
        • catalan.edm499 kB
        • arabic.edm864 kB
        • korean.edm19 B
        • serbian.edm1 MB
        • slovak.edm1 MB
        • thai.edm1 MB
        • hindi.edm203 kB
        • persian.edm794 kB
        • romanian.edm691 kB
        • dutch.edm323 kB
        • norwegian_bokmal.edm362 kB
        • lithuanian.edm383 kB
        • chinese_traditional.edm12 MB
        • bulgarian.edm964 kB
        • icelandic.edm521 kB
        • welsh.edm256 kB
        • croatian.edm591 kB
        • maltese.edm235 kB
        • chinese_simplified.edm2 MB
        • hungarian.edm976 kB
        • irish.edm399 kB
        • estonian.edm778 kB
        • portuguese.edm499 kB
        • greek.edm1 MB
        • tamil.edm144 kB
        • bengali.edm191 kB
        • italian.edm243 kB
        • swedish.edm426 kB
        • finnish.edm355 kB
        • telugu.edm194 kB
        • urdu.edm1 MB
        • turkish.edm845 kB
        • indonesian.edm154 kB
        • ukrainian.edm1 MB
        • hebrew.edm749 kB
        • polish.edm521 kB
        • latvian.edm472 kB
        • gujarati.edm232 kB
        • malay.edm146 kB
        • russian.edm1 MB
        • vietnamese.edm531 kB
        • english.edm105 kB
        • armenian.edm448 kB
        • slovene.edm473 kB
        • french.edm353 kB
        • german.edm424 kB
        • spanish.edm363 kB
    • MANIFEST.in98 B
    • bin
      • chared3 kB
      • chared-learn13 kB
    • setup.py1 kB
    • README277 B
    • COPYING1 kB
    • PKG-INFO720 B

Show simple item record