Chared

Pomikálek, Jan

Chared

LINDAT / CLARIAH-CZ

Authors: Pomikálek, Jan

Item identifier: http://hdl.handle.net/11858/00-097C-0000-000D-F67A-9

Project URL: http://code.google.com/p/chared/

Date issued: 2011

Type: toolService

Size: 23 mb

Language(s): English

Description: Chared is a software tool which can detect character encoding of a text document provided the language of the document is known. The language of the text has to be specified as an input parameter so that the corresponding language model can be used. The package contains models for a wide range of languages (currently 57 --- covering all major languages). Furthermore, it provides a training script to learn models for additional languages using a set of user supplied sample html pages in the given language. The detection algorithm is based on determining similarity of byte trigrams vectors. In general, chared should be more accurate than other character encoding detection tools with no language constraints. This is an important advantage allowing precise character decoding needed for building large textual corpora. The tool has been used for building corpora in American Spanish, Arabic, Czech, French, Japanese, Russian, Tajik, and six Turkic languages consisting of 70 billions tokens altogether. Chared is an open source software, licensed under New BSD License and available for download (including the source code) at http://code.google.com/p/chared/. The research leading to this piece of software was published in POMIKÁLEK, Jan a Vít SUCHOMEL. chared: Character Encoding Detection with a Known Language. In Aleš Horák, Pavel Rychlý. RASLAN 2011. 5. vyd. Brno, Czech Republic: Tribun EU, 2011. od s. 125-129, 5 s. ISBN 978-80-263-0077-9.

Publisher: Masaryk University, NLP Centre

Subject(s): character encoding character encoding detection charset unicode

Collection(s): LINDAT / CLARIAH-CZ Data & Tools

Show full item record

Files in this item

This item is

Publicly Available

and licensed under:
BSD 3-Clause "New" or "Revised" license

Name: chared-1.2.1.tar.gz
Size: 23.04 MB
Format: application/x-gzip
Description: chared is a tool for detecting the character encoding of a text in a known language. The language of the text has to be specified as an input parameter so that correspondent language model can be used. The package contains models for a wide range of languages. In general, it should be more accurate than character encoding detection algorithms with no language constraints.
MD5: 12d042d631b0b30c9540ff9e1a5b7a13

Download file Preview

File Preview

chared-1.2.1
- CHANGES572 B
- chared
  - util
    - encoding.py1 kB
    - html2txt.py2 kB
    - __init__.py0 B
    - trigrams.py6 kB
  - detector.py7 kB
  - __init__.py378 B
  - models
    - czech.edm883 kB
    - japanese.edm4 MB
    - malayalam.edm133 kB
    - catalan.edm499 kB
    - arabic.edm864 kB
    - korean.edm19 B
    - serbian.edm1 MB
    - slovak.edm1 MB
    - thai.edm1 MB
    - hindi.edm203 kB
    - persian.edm794 kB
    - romanian.edm691 kB
    - dutch.edm323 kB
    - norwegian_bokmal.edm362 kB
    - lithuanian.edm383 kB
    - chinese_traditional.edm12 MB
    - bulgarian.edm964 kB
    - icelandic.edm521 kB
    - welsh.edm256 kB
    - croatian.edm591 kB
    - maltese.edm235 kB
    - chinese_simplified.edm2 MB
    - hungarian.edm976 kB
    - irish.edm399 kB
    - estonian.edm778 kB
    - portuguese.edm499 kB
    - greek.edm1 MB
    - tamil.edm144 kB
    - bengali.edm191 kB
    - italian.edm243 kB
    - swedish.edm426 kB
    - finnish.edm355 kB
    - telugu.edm194 kB
    - urdu.edm1 MB
    - turkish.edm845 kB
    - indonesian.edm154 kB
    - ukrainian.edm1 MB
    - hebrew.edm749 kB
    - polish.edm521 kB
    - latvian.edm472 kB
    - gujarati.edm232 kB
    - malay.edm146 kB
    - russian.edm1 MB
    - vietnamese.edm531 kB
    - english.edm105 kB
    - armenian.edm448 kB
    - slovene.edm473 kB
    - french.edm353 kB
    - german.edm424 kB
    - spanish.edm363 kB
- MANIFEST.in98 B
- bin
  - chared3 kB
  - chared-learn13 kB
- setup.py1 kB
- README277 B
- COPYING1 kB
- PKG-INFO720 B