Chared

Pomikálek, Jan

dc.contributor.author	Pomikálek, Jan
dc.date.accessioned	2013-02-01T16:32:21Z
dc.date.available	2013-02-01T16:32:21Z
dc.date.issued	2011
dc.identifier.uri	http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9
dc.description	Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.
dc.description.sponsorship	PRESEMT, Lexical Computing Ltd
dc.language.iso	eng
dc.publisher	Masaryk University, NLP Centre
dc.rights	BSD 3-Clause "New" or "Revised" license
dc.rights.uri	http://opensource.org/licenses/BSD-3-Clause
dc.source.uri	http://code.google.com/p/chared/
dc.subject	character encoding
dc.subject	character encoding detection
dc.subject	charset
dc.subject	unicode
dc.title	Chared
dc.type	toolService
metashare.ResourceInfo#ContactInfo#PersonInfo.surname	Pomikálek
metashare.ResourceInfo#ContactInfo#PersonInfo.givenName	Jan
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo.organizationName	Natural Language Processing Centre, Faculty of Informatics Masaryk University
metashare.ResourceInfo#DistributionInfo.availability	restrictedUse
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse	attribution
metashare.ResourceInfo#DistributionInfo#LicenseInfo.restrictionsOfUse	shareAlike
metashare.ResourceInfo#DistributionInfo#LicenseInfo.distributionAccessMedium	downloadable
metashare.ResourceInfo#ValidationInfo.validated	True
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.projectName	PRESEMT
metashare.ResourceInfo#ResourceCreationInfo#FundingInfo#ProjectInfo.fundingType	euFunds
metashare.ResourceInfo#TextInfo#SizeInfo.size	23
metashare.ResourceInfo#TextInfo#SizeInfo.sizeUnit	mb
metashare.ResourceInfo#ContactInfo#PersonInfo#OrganizationInfo#CommunicationInfo.email	jan.pomikalek@gmail.com
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	false
metashare.ResourceInfo#ContentInfo.detailedType	tool
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
sponsor	PRESEMT
sponsor	Lexical Computing Ltd.
size.info	23 mb
files.size	24156936
files.count	1

Files in this item

This item is

Publicly Available

and licensed under:
BSD 3-Clause "New" or "Revised" license

Name: chared-1.2.1.tar.gz
Size: 23.04 MB
Format: application/x-gzip
Description: chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
MD5: 12d042d631b0b30c9540ff9e1a5b7a13

Download file Preview

File Preview

chared-1.2.1
- CHANGES572 B
- chared
  - util
    - encoding.py1 kB
    - html2txt.py2 kB
    - __init__.py0 B
    - trigrams.py6 kB
  - detector.py7 kB
  - __init__.py378 B
  - models
    - czech.edm883 kB
    - japanese.edm4 MB
    - malayalam.edm133 kB
    - catalan.edm499 kB
    - arabic.edm864 kB
    - korean.edm19 B
    - serbian.edm1 MB
    - slovak.edm1 MB
    - thai.edm1 MB
    - hindi.edm203 kB
    - persian.edm794 kB
    - romanian.edm691 kB
    - dutch.edm323 kB
    - norwegian_bokmal.edm362 kB
    - lithuanian.edm383 kB
    - chinese_traditional.edm12 MB
    - bulgarian.edm964 kB
    - icelandic.edm521 kB
    - welsh.edm256 kB
    - croatian.edm591 kB
    - maltese.edm235 kB
    - chinese_simplified.edm2 MB
    - hungarian.edm976 kB
    - irish.edm399 kB
    - estonian.edm778 kB
    - portuguese.edm499 kB
    - greek.edm1 MB
    - tamil.edm144 kB
    - bengali.edm191 kB
    - italian.edm243 kB
    - swedish.edm426 kB
    - finnish.edm355 kB
    - telugu.edm194 kB
    - urdu.edm1 MB
    - turkish.edm845 kB
    - indonesian.edm154 kB
    - ukrainian.edm1 MB
    - hebrew.edm749 kB
    - polish.edm521 kB
    - latvian.edm472 kB
    - gujarati.edm232 kB
    - malay.edm146 kB
    - russian.edm1 MB
    - vietnamese.edm531 kB
    - english.edm105 kB
    - armenian.edm448 kB
    - slovene.edm473 kB
    - french.edm353 kB
    - german.edm424 kB
    - spanish.edm363 kB
- MANIFEST.in98 B
- bin
  - chared3 kB
  - chared-learn13 kB
- setup.py1 kB
- README277 B
- COPYING1 kB
- PKG-INFO720 B

Show simple item record