This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
What's New
Author(s):
Description:
This dataset contains data for testing machine translation and topic classification in Piedmontese. It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (Adelani et al., EACL 2024).
This item contains 1 file (250.38 KB).
Publicly Available
lexicalConceptualResourceLINDAT / CLARIAH-CZ
Author(s):
Description:
The ontology provides a FAIR, interoperable vocabulary for grammatical error annotation and correction, integrating the English-focused ERRANT taxonomy with Czech-specific extensions from ERRANT-CZ and fine-grained categories derived from Czech proofreading and correction rules (Opravidlo). The ontology formalizes error types, subtypes, and correction operations in RDF, aligns linguistic properties with the LexInfo ontology, and supports multilingual grammatical error correction research, annotation interoperability, and data reuse.
This item contains 1 file (98.27 KB).
Publicly Available
Author(s):
Description:
The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline for predicting empty nodes in the input CoNLL-U files, trained on CorefUD 1.4 data. It was was used to generate baseline empty nodes prediction in the CRAC 2026 Shared Task on Multilingual Coreference Resolution https://ufal.mff.cuni.cz/corefud/crac26. The model is language agnostic, so in theory it can be used to predict coreference in any XLM-RoBERTa language. Compared to the last year CRAC 2025 Empty Nodes Baseline https://github.com/ufal/crac2025_empty_nodes_baseline, this year's baseline predicts all available information for the empty nodes, i.e., including forms, lemmas, UPOS, XPOS, and FEATS columns, in addition to previously predicted word order and dependency relations of the empty nodes. Instructions for running prediction, training, and intrinsic evaluation are all available in the repository CRAC 2026 Empty Nodes Baseline https://github.com/ufal/crac2026_empty_nodes_baseline.
This item contains 1 file (2.17 GB).
Publicly Available
Most Viewed Items - Last Month
Author(s):
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models . To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe . In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
This item contains 96 files (2.61 GB).
Publicly Available
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (740.61 MB).
Publicly Available
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (765.09 MB).
Publicly Available