BulTreeBank Tokenizer
- Autoři
- Simov, Kiril and Simov, Kiril
- Identifikátor
- http://hdl.handle.net/11372/LRT-1240
- URL projektu
- http://www.bultreebank.org/clark/index.html
- Datum vydání
- 2014-07-30
- Typ
- toolService
- Popis
- The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.