It is used morphological lexicon of Bulgarian (100 000 lemmas) compiled as a finite-state automaton in CLaRK System. It requires the text to be first tokenized and it is applied in each token. Includes also guessers for unknown words and Named Entities gazetteers. If the corresponding resources are available for a different language, then it can be tuned to it.
This is a hybrid system: rules, neural network, rules. First
rules for the sure cases are applied, then a neural network
disambiguator is applied, then rules for repairing of the most
frequent errors of the neural network. The rules are implemented
as constraints in CLaRK System. The neural network is additional
module implemented in Java. It is called CLaRK. It requires the
morphologically annotated input.
The tokenizer is covering all languages that use Latin1, Laitn2, Latin3 and Cyrillic tables of Unicode. Can be extended to cover other tables in Unicode if necessary. The implementation is as a cascaded regular grammar in CLaRK. It recognizes over 60 token categories. It is easy to be adapted to new token categories.
The CLaRK System incorporates several technologies:
- XML technology
- Unicode
- Cascaded Regular Grammars;
- Constraints over XML Documents
On the basis of these technologies the following tools are implemented: XML Editor, Unicode Tokeniser, Sorting tool, Removing and Extracting tool, Concordancer, XSLT tool,
Cascaded Regular Grammar tool, etc.
1 Unicode tokenization
In order to provide possibility for imposing constraints over the textual node and to segment them in meaningful way, the CLaRK System supports a user-defined hierarchy of tokenisers. At the very basic level the user can define a tokeniser in terms of a set of token types. In this basic tokeniser each token type is defined by a set of UNICODE symbols. Above this basic level tokenisers, the user can define other tokenisers, for which the token types are defined as regular expressions over the tokens of some other tokeniser, the so called parent tokeniser.
2 Regular Grammars
The regular grammars are the basic mechanism for linguistic processing of the content of an XML document within the system. The regular grammar processor applies a set of rules over the content of some elements in the document and incorporates the categories of the rules back in the document as XML mark-up. The content is processed before the application of the grammar rules in the following way: textual nodes are tokenized with respect to some appropriate tokeniser, the element nodes are textualized on the basis of XPath expressions that determine the important information about the element. The recognized word is substituted by a new XML mark-up, which can or can not contain the word.
3 Constraints
The constraints that we implemented in the CLaRK System are generally based on the XPath language. We use XPath expressions to determine some data within one or several XML
documents and thus we evaluate some predicates over the data. There are two modes of using a constraint. In the first mode the constraint is used for validity check, similar to the validity check, which is based on DTD or XML schema. In the second mode, the constraint is used to
support the change of the document in order it to satisfy the constraint. There are three types of constraints, implemented in the system: regular expression constraints, number restriction constraints, value restriction constraints.
4 Macro Language
In the CLaRK System the tools support a mechanism for describing their settings. On the basis of these descriptions (called queries) a tool can be applied only by pointing to a certain description record. Each query contains the states of all settings and options which the
corresponding tool has. Once having this kind of queries there is a special tool for combining and applying them in groups (macros). During application the queries are executed successively and the result from an application is an input for the next one.
For a better control on the process of applying several queries in one we introduce several conditional operators. These operators can determine the next query for application depending on certain conditions. When a condition for such an operator is satisfied, the execution continues from a location defined in the operator. The mechanism for addressing queries is based on user defined labels. When a condition is not satisfied the operator is ignored and the process continues from the position following the operator. In this way constructions like IF-THEN-ELSE and WHILE-DO easily can be expressed.
The system supports five types of control operators:
IF (XPath): the condition is an XPath expression which is evaluated on the current working document. If the result is a non-empty node-set, non-empty string, positive number or
true boolean value the condition is satisfied;
IF NOT (XPath): the same kind of condition as the previous one but the approving result is negated;
IF CHANGED: the condition is satisfied if the preceding operation has changed the current working document or has produced a non-empty result document (depending on the operation);
IF NOT CHANGED: the condition is satisfied if either the previous operation did not change the working document or did not produce a non-empty result.
GOTO: unconditional changing the execution position.
Each macro defined in the system can have its own query and can be incorporated in another macro. In this way some limited form of subroutine can be implemented.
The new version of CLaRK will support server applications, calls to/from external programs.
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu).
The texts are sentences from the Europarl parallel corpus (Koehn, 2005). We selected the monolingual sentences from parallel corpora for the following pairs: Bulgarian-English, Czech-English, Portuguese-English and Spanish-English. The English corpus is comprised by the English side of the Spanish-English corpus.
Basque is not in Europarl. In addition, it contains the Basque and English sides of the GNOME corpus.
The texts have been automatically annotated with NLP tools, including Word Sense Disambiguation, Named Entity Disambiguation and Coreference resolution. Please check deliverable D5.6 in http://qtleap.eu/deliverables for more information.
This corpora is part of Deliverable 5.5 of the European Commission project QTLeap FP7-ICT-2013.4.1-610516 (http://qtleap.eu).
The texts are Q&A interactions from the real-user scenario (batches 1 and 2). The interactions in this corpus are available in Basque, Bulgarian, Czech, English, Portuguese and Spanish.
The texts have been automatically annotated with NLP tools, including Word Sense Disambiguation, Named Entity Disambiguation and Coreference resolution. Please check deliverable D5.6 in http://qtleap.eu/deliverables for more information.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This is the second release of UD Treebanks, Version 1.1.
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).