Treex::Web
Authors:
Martin Popel, Zdeněk Žabokrtský, Jan Ptáček, Petr Pajas
Description:
Treex (formerly TectoMT) is a highly modular NLP software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.
 
 
Česílko
Authors:
Jan Hajič, Vladislav Kuboň, Petr Homola
Description:
The system Česílko was designed as a tool enabling the fast and efficient translation from one source language into many target languages, which are mutually related. The system receives as its input a high quality human translation of the original into Czech (from any language). It translates the Czech input into a number of languages related to Czech. The system contains at the moment 5 language pairs, 4 of them only as experiments, namely Czech into Polish, Lithuanian, Macedonian and Lower Sorbian. Unfortunately, the system cannot be tested on arbitrary texts for these language pairs due to a small size of all dictionaries. The only working language pair (and at the same time also exploitable outside of the above mentioned setup) is the fifth one, Czech to Slovak. Similarly to other MT systems, Česílko requires human post-editing. The system is being developed since 1998
 
 
PML-Tree Query
Authors:
Petr Pajas, Jan Štěpánek
Description:
PML-TQ is a powerful open-source search tool for all kinds of linguistically annotated tree-banks with several client interfaces and two search back-ends (one based on a SQL database and one based on Perl and the TrEd toolkit). The tool works natively with tree-banks encoded in the PML data format (conversion scripts are available for many established tree-bank formats).
 
 
PDT-Vallex
Authors:
Zdeňka Urešová, Jan Štěpánek, Jan Hajič, Jarmila Panevova, Marie Mikulová
Description:
The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
 
 
EngVallex
Authors:
Silvie Cinková, Eva Fučíková, Jana Šindlerová, Jan Hajič
Description:
EngVallex is the English counterpart of the PDT-Vallex valency lexicon, using the same view of valency, valency frames and the description of a surface form of verbal arguments. EngVallex contains links also to PropBank and Verbnet, two existing English predicate-argument lexicons used, i.a., for the PropBank project. The EngVallex lexicon is fully linked to the English side of the PCEDT parallel treebank, which is in fact the PTB re-annotated using the Prague Dependency Treebank style of annotation. The EngVallex is available in an XML format in our repository, and also in a searchable form (see the WEBSITE link below) with examples from the PCEDT.
 
 
CzEngVallex
Authors:
Zdeňka Urešová, Eva Fučíková, Jan Hajič, Jana Šindlerová
Description:
CzEngVallex is a bilingual valency lexicon of corresponding Czech and English verbs. It connects 20835 aligned valency frame pairs (verb senses) which are translations of each other, aligning their arguments as well. The CzEngVallex serves as a powerful, real-text-based database of frame-to-frame and subsequently argument-to-argument pairs and can be used for example for machine translation applications. It uses the data from the Prague Czech-English Dependency Treebank project (PCEDT 2.0) and it also takes advantage of two existing valency lexicons: PDT-Vallex for Czech and EngVallex for English, using the same view of valency (based on the Functional Generative Description theory). The CzEngVallex is available in an XML format in the LINDAT/CLARIN repository, and also in a searchable form (see the “More Apps” tab) interlinked with PDT-Vallex, EngVallex and with examples from the PCEDT.
 
 
Authors:
Jan Hajič
Description:
This interface connects you to a morphological server, which performs morphological analysis of words forms for Czech. The results are presented in a tabular, color-coded form where everything can by converted to a more detailed, human-readable description by a single click. The main features of the interface and of the morpholgical server are: It is 100% compatible with the HMAnalyze.pl program for batch morpholgical processing, using the 50k version of the Czech morpholgical dictionary (including input text tokenization) The possibility to run a tagger instead of a plain morphological analyzer; the tagger uses a more advanced morphological dictionary and statistically based rules for disambiguation. Current version of the morphology the tagger uses is CZ11122a. The possibility to display basic frequency information about word forms and lemmas, as computed from the large Czech National Corpus. The interface allows various coding options, both for input of word forms and output (displaying) of results, in case you have difficulty entering and/or displaying the Czech characters (which are part of the Latin 2 character set). The interface allows to pass "Content Filtering" options to the morphological analyzer, ensuring the same set of filtering rules is applied as if using the "batch" version. It is also 100% compatible with HMAnalyze.pl filtering options. The interface contains links to various Help/Description pages. In addition, the results are all presented as live "links" to scripts which produce and display a breakdown of the tags and lemma-related information in human-readable form.
 
 
MorphoDiTa
Authors:
Milan Straka, Jana Straková
Description:
MorphoDiTa: Morphological Dictionary and Tagger is an open-source tool for morphological analysis of natural language texts. It performs morphological analysis, morphological generation, tagging and tokenization and is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, MorphoDiTa achieves state-of-the-art results with a throughput around 10-200K words per second. MorphoDiTa is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
 
 
NameTag
Authors:
Milan Straka, Jana Straková
Description:
NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc. NameTag is distributed as a standalone tool or a library, along with trained linguistic models. In the Czech language, NameTag achieves state-of-the-art performance (Straková et. al. 2013). NameTag is a free software under LGPL license and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
 
 
Authors:
Institute of Czech Language, Czech Academy of Sciences; NLP Centre at the Faculty of Informatics, Masaryk University
Description:
The ILRB has been created by two cooperating teams - by the team of the Institute of Czech Language, Czech Academy of Sciences and the team of the NLP Centre at the Faculty of Informatics, Masaryk University (2004-2008). The tool consists of two sections: wordlist and reference (explanatory) one. Comments and remarks are welcome and should be send to the address poradna@ujc.cas.cz. 1. Wordlist section It contains more than 60 000 dictionary entries and is based on the glossary of the School Rules of Czech Orthography, the Dictionary of the Literary Czech and selected entries from the New Dictionary of Words of Foreign Origin and Dictionary of Neologisms. The entries typically include information that is asked about frequently by the users. Also inflectional forms of the particular words forms are offered in the form of tables thanks to the morphological analyzer ajka created at the Faculty of Informatics, MU. The dictionary part is linked to the explanatory one through the hypertext links. 2. Reference section It comprises the explanations about linguistic phenomena described in the Rules of Czech Orthography and contemporary Czech grammars, frequently and repeatedly asked by the users turning to the Linguistic Advisory Line in the Institute of Czech Language. In the offered explanations some typical spelling problems are dealt with including the appropriate recommendations. The ILRB is regularly updated and completed, new expressions are added and made more precise.
 
 
ElixirFM
Authors:
Otakar Smrž
Description:
ElixirFM is a high-level implementation of Functional Arabic Morphology. ElixirFM can process words of Modern Written Arabic using four different modes. Here, you can learn how to use these modes for various purposes.
 
 
Authors:
Nino Peterek
Description:
The Dialogy.Org system allows users to search texts (transcripts), watch video recordings and view F0 acoustic wave forms. The Dialogy.Org system works on the principle of web-based interface, so installation of additional programs on your computer is not necessary. You must have Flash Player for playing video recordings.
 
 
Korektor
Authors:
Michal Richter, Pavel Straňák, Milan Straka
Description:
Korektor is a statistical spell- and (occasional) grammar-checker. This spellchecker strarted with Michal Richter's diploma thesis Advanced Czech Spellchecker, but it is being developed further. There are two versions: a unix command line utility (tested on Debian, Ubuntu and OS X) and an OS X SpellServer with a System Service, that integrates with native OS X GUI applications.
 
 
Authors:
Ondřej Bojar, Aleš Tamchyna, Ondřej Cifka
Description:
Moses is an implementation of the statistical (or data-driven) approach to machine translation (MT). This is the dominant approach in the field at the moment, and is employed by the online translation systems deployed by the likes of Google and Microsoft. Recently, our hybrid system Chimera, with Moses as the central component, performed better than Google Translate when translating from English to Czech.
 
 
Authors:
NLP Centre, Masaryk University (Brno), Institute of the Czech National Corpus (Prague), Institute of Formal and Applied Linguistics (Prague)
Description:
KonText is a basic web application for querying corpora available within the LINDAT/CLARIN project. It allows evaluation of simple and complex queries, displaying their results as concordance lines, computing frequency distribution, calculating association measures for collocations and further work with language data. This instance is a fork of KonText application (developed by the Institute of the Czech National Corpus) that has been further extended by the Institute of Formal and Applied Linguistics to suit the needs of LINDAT/CLARIN project.
 
 
Parsito
Authors:
Milan Straka
Description:
Parsito is a fast open-source dependency parser written in C++. Parsito is based on greedy transition-based parsing, it has very high accuracy and achieves a throughput of 30K words per second. Parsito can be trained on any input data without feature engineering, because it utilizes artificial neural network classifier. Trained models for all treebanks from Universal Dependencies project are available (37 treebanks as of Dec 2015). Parsito is a free software under Mozilla Public License 2.0 and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.
 
 
Authors:
Jindřich Libovický
Description:
KER is a keyword extractor that was designed for scanned texts in Czech and English. It is based on the standard tf-idf algorithm with the idf tables trained on texts from Wikipedia. To deal with the data sparsity, texts are preprocessed by Morphodita: morphological dictionary and tagger.
 
 
UDPipe
Authors:
Milan Straka, Jana Straková
Description:
UDPipe is an trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given only annotated data in CoNLL-U format. Trained models are provided for nearly all UD treebanks. UDPipe is available as a binary, as a library for C++, Python, Perl, Java, C#, and as a web service. UDPipe is a free software under Mozilla Public License 2.0 and the linguistic models are free for non-commercial use and distributed under CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions.