An LMF conformant XML-based file containing a comprehensive Arabic broken plural list. The file contains 12,249 singular words with their corresponding BPs
HPSG-based annotation including: constituent structure, dependency relations, named entities (classified as person, organisation, location or other names), coreferential relations. Annotation in XML
It is used morphological lexicon of Bulgarian (100 000 lemmas) compiled as a finite-state automaton in CLaRK System. It requires the text to be first tokenized and it is applied in each token. Includes also guessers for unknown words and Named Entities gazetteers. If the corresponding resources are available for a different language, then it can be tuned to it.
Written, synchronic, general, manually annotated, 1 000 000 tokens divided in three sets: 215 000 tokens used in BulTreeBank HPSG Treebank (see below), additionally 300 000 checked second time, rest about 480 000 checked by the annotators. Morphosyntactic annotation with the BulTreeBank Tagset (http://www.bultreebank.org/TechRep/BTB-TR03.pdf), XML, annotation description in technical reports of BulTreeBank project http://www.bultreebank.org/TechRep
This is a hybrid system: rules, neural network, rules. First
rules for the sure cases are applied, then a neural network
disambiguator is applied, then rules for repairing of the most
frequent errors of the neural network. The rules are implemented
as constraints in CLaRK System. The neural network is additional
module implemented in Java. It is called CLaRK. It requires the
morphologically annotated input.