Czech models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex CZ 160310 and the PoS tagger is trained on Prague Dependency Treebank 3.0 (PDT). and This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).
The Czech morphologic system was devised by Jan Hajič.
The MorfFlex CZ dictionary was created by Jan Hajič and Jaroslava Hlaváčová.
The morphologic guesser research was supported by the projects 1ET101120503 and 1ET101120413 of Academy of Sciences of the Czech Republic and 100008/2008 of Charles University Grant Agency. The research was performed by Jan Hajič, Jaroslava Hlaváčová and David Kolovratník.
The tagger algorithm and feature set research was supported by the projects MSM0021620838 and LC536 of Ministry of Education, Youth and Sports of the Czech Republic, GA405/09/0278 of the Grant Agency of the Czech Republic and 1ET101120503 of Academy of Sciences of the Czech Republic. The research was performed by Drahomíra "johanka" Spoustová, Jan Hajič, Jan Raab and Miroslav Spousta.
The tagger is trained on morphological layer of Prague Dependency Treebank PDT 2.5, which was supported by the projects LM2010013, LC536, LN00A063 and MSM0021620838 of Ministry of Education, Youth and Sports of the Czech Republic, and developed by Martin Buben, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Emil Jeřábek, Lenka Kebortová, Kristýna Kupková, Pavel Květoň, Jiří Mírovský, Andrea Pfimpfrová, Jan Štěpánek and Daniel Zeman.
The Czech models for Korektor 2 created by Michal Richter, 02 Feb 2013. The models can either perform spellchecking and grammarchecking, or only generate diacritical marks. and This work was created by Michal Richter as an extension of his diploma thesis Advanced Czech Spellchecker. The models utilize MorfFlex CZ dictionary (http://hdl.handle.net/11858/00-097C-0000-0015-A780-9) created by Jan Hajič and Jaroslava Hlaváčová.
The presented Czech Named Entity Corpus 1.0 is the first publicly available corpus providing a large body of manually annotated named entities in Czech sentences, including a fine-grained classification. and 1ET101120503 (Integrace jazykových zdrojů za účelem extrakce informací z přirozených textů)
Czech Named Entity Corpus 1.1 fixes some issues of the Czech Named Entity Corpus 1.0: misannotated entities are fixed, all formats contain the same data, tmt format is replaced with treex format, all formats contain splitting into training, development and testing portion of the data. and SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
Czech Named Entity Corpus 2.0 is a corpus of 8993 Czech sentences with manually annotated 35220 Czech named entities, classified according to a two-level hierarchy of 46 named entities. and SVV 267 314 (Teoretické základy informatiky a výpočetní lingvistiky), LM2010013 (LINDAT-CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat), GPP406/12/P175 (Vybrané derivační vztahy pro automatické zpracování češtiny), PRVOUK (PRVOUK)
The corpus contains Czech expressive speech recorded using scenario-based approach by a professional female speaker. The scenario was created on the basis of previously recorded natural dialogues between a computer and seniors. and European Commission Sixth Framework Programme
Information Society Technologies Integrated Project IST-34434
Czech subjectivity lexicon, i.e. a list of subjectivity clues for sentiment analysis in Czech. The list contains 4626 evaluative items (1672 positive and 2954 negative) together with their part of speech tags, polarity orientation and source information.
The core of the Czech subjectivity lexicon has been gained by automatic translation of a freely available English subjectivity lexicon downloaded from http://www.cs.pitt.edu/mpqa/subj_lexicon.html. For translating the data into Czech, we used parallel corpus CzEng 1.0 containing 15 million parallel sentences (233 million English and 206 million Czech tokens) from seven different types of sources automatically annotated at surface and deep layers of syntactic representation. Afterwards, the lexicon has been manually refined by an experienced annotator. and The work on this project has been supported by the GAUK 3537/2011 grant and by SVV project number 267 314.
The corpus contains video files of Czech Television News Broadcasts and JSON files with annotations of faces that appear in the broadcasts. The annotations are composed of frames in which a face is seen, name of the person whose face is seen, gender of the person (male/female), and the image region containing the face. The intended use of the corpus is to train models of faces for face detection, face identification, face verification, and face tracking. For convinience two different JSON files are provided. They contain the same data, but in different arrangements. One file has the identity of the person on the top, the other has the object ID on the top, where the object is a facetrack. A demo python skript is available for showing how to access the data.
The EBUContentGenre is a thesaurus containing the hierarchical description of various genres utilized in the TV broadcasting industry. This thesaurus is a part of a complex metadata specification called EBUCore intended for multifaceted description of audiovisual content. EBUCore (http://tech.ebu.ch/docs/tech/tech3293v1_3.pdf) is a set of descriptive and technical metadata based on the Dublin Core and adapted to media. EBUCore is the flagship metadata specification of European Broadcasting Union, the largest professional association of broadcasters around the world. It is developed and maintained by EBU's Technical Department (http://tech.ebu.ch). The translated thesaurus can be used for effective cataloguing of (mostly TV) audiovisual content and consequent development of systems for automatic cataloguing (topic/genre detection). and Technology Agency of the Czech Republic, project No. TA01011264
A slightly modified version of the Czech Wordnet. This is the version used to annotate "The Lexico-Semantic Annotation of PDT using Czech WordNet": http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
The Czech WordNet was developed by the Centre of Natural Language Processing at the Faculty of Informatics, Masaryk University, Czech Republic.
The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 23,094 word senses (synsets). 203 of these were created or modified by UFAL during correction of annotations. This version of WordNet was used to annotate word senses in PDT: http://hdl.handle.net/11858/00-097C-0000-0001-487A-4
A more recent version of Czech WordNet is distributed by ELRA: http://catalog.elra.info/product_info.php?products_id=1089 and 1ET201120505, LM2010013