This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
What's New
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (779.37 MB).
Publicly Available
lexicalConceptualResourceLINDAT / CLARIAH-CZ
Author(s):
Description:
DeriVallex 1.0 is a valency lexicon of automatically generated valency frames of Czech noun and adjectival derivatives the valency of which exhibits systemic correspondences with the valency of their base words. It contains 10,220 derivatives corresponding to 17,288 lexical units (i.e., individual senses). In particular, DeriVallex describes 3,134 nouns corresponding to 5,089 lexical units and 7,086 adjectives corresponding to 12,199 lexical units. DeriVallex was created with the aim of providing information on the valency of nouns and adjectives, which is not sufficiently covered in existing lexical resources. Focusing on nominal and adjectival derivatives that exhibit systematic valency behavior in comparison with their base words, it captures the productive and systemic core of the Czech lexicon, thus laying the foundation for the further extension of current lexical resources. The following word-formation categories are covered: action nouns (e.g., dobytí města nepřáteli ‘conquering the city by enemies’), quality nouns (e.g., učitelova laskavost k dětem ‘the teacher’s kindness to children’), simultaneous action adjectives (e.g., lidé bojující proti bezpráví ‘people fighting against injustice’), anterior action adjectives (e.g., dluh narostlý na 400 milionů ‘a debt that has risen to 400 million’ and muži navrátivší se z války ‘men who have returned from the war’), passive action adjectives (e.g., úspory diktované Evropě konzervativní vládou ‘austerity measures dictated to Europe by a conservative government’), and potentiality adjectives (e.g., dužina oddělitelná od pecky ‘flesh separable from the pit’). In compiling the lexicon, data from the following lexical resources were used: NomVallex 2.6, VALLEX 4.5, and DeriNet 2.3. To satisfy different needs of potential users, the lexicon is distributed (i) online in an HTML version (providing a user-friendly interface allowing human users to search and filter the data) and (ii) in this distribution in a machine-readable form, so that the data can be used in NLP applications. Authors: Václava Kettnerová, Jiří Mírovský, Veronika Kolářová and Michal Olbrich Acknowledgement: The creation of the DeriVallex lexicon has been supported by the LINDAT/CLARIAH-CZ Research Infrastructure (https://lindat.cz), supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062), and it has been using data and tools provided by this project too. License: DeriVallex is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA). Its non-commercial use is conditioned by appropriate citation: Kettnerová, Václava and Mírovský, Jiří and Kolářová, Veronika and Olbrich, Michal. 2026. DeriVallex 1.0. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL). http://hdl.handle.net/11234/1-6109.
This item contains 1 file (25.63 MB).
Publicly Available
Music notationLINDAT / CLARIAH-CZ
Author(s):
Description:
The MusiCorpus v1.0 dataset provides 1,309 pages of historical sheet music, primarily handwritten, with MusicXML transcriptions and symbol annotations, for training and evaluating Optical Music Recognition (OMR) systems in realistic conditions. It is the largest dataset of handwritten music to date and the first dataset containing a realistic representative sample of musical document collections from memory institutions. A large amount of musical heritage has been digitised by archival institutions, but the field of Optical Music Recognition (OMR) has struggled with making this music machine-readable and therefore findable despite progress in deep learning methods, because no datasets for training systems in realistic conditions were available. MusiCorpus is suitable for training and evaluating both end-to-end systems and object detection-based OMR systems and comparing their performance.
This item contains 1 file (14.06 GB).
Publicly Available
Most Viewed Items - Last Month
Author(s):
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models . To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe . In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
This item contains 96 files (2.61 GB).
Publicly Available
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (740.61 MB).
Publicly Available
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (765.09 MB).
Publicly Available