BulTreeBank Tokenizer
Please use the following text to cite this item or export to a predefined format:
Simov, Kiril, 2014,
BulTreeBank Tokenizer, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11372/LRT-1240.
Authors
Item identifier
Project URL
Date issued
2014-07-30
Type
Description
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
Collections

