dc.contributor.author |
Simov, Kiril |
dc.contributor.other |
Simov, Kiril |
dc.date.accessioned |
2014-07-30T21:33:43Z |
dc.date.available |
2014-07-30T21:33:43Z |
dc.date.issued |
2014-07-30 |
dc.identifier.uri |
http://hdl.handle.net/11372/LRT-1240 |
dc.description |
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories. |
dc.publisher |
Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences |
dc.source.uri |
http://www.bultreebank.org/clark/index.html |
dc.title |
BulTreeBank Tokenizer |
dc.type |
toolService |
has.files |
no |
additional.metadata |
Language(s) of input data (field_tool_input_language):-- any --
Implementation language(s) (field_tool_implementation_langu):Java
Software requirements (field_tool_software_requirement):Java
Webservice link (field_tool_webservice_link):http://www.bultreebank.org/clark/index.html
Availibility (field_tool_availibility):free
Nid:993
System requirements (field_tool_system_requirements):Java based
Platform(s) (field_tool_platform):used under MS Windows, Linux
Character encoding of output data (field_tool_char_encoding_output):Unicode (UTF-8)
Documentation link (field_tool_document_link):not available
Approach (field_tool_aproach):cascaded regular grammars (finite-state)
Open source code (field_tool_open_source_code):no
Language(s) of output data (field_tool_output_language):-- any --
Character encoding of input data (field_tool_char_encoding):Unicode (UTF-8)
Version (field_tool_version):1.0 |
branding |
LRT + Open Submissions |
dc.coverage.placeName |
Bulgaria |
files.size |
0 |
files.count |
0 |