This is a new version of the repository. Do let us know (lindat-help at ufal.mff.cuni.cz) if you encounter any issues.
What's New
Author(s):
Description:
The goal of the Paraphrase Identification (PI) task is to determine whether two sentences have the same meaning. The repository contains two PI datasets, namely paws (https://huggingface.co/datasets/paws) and quora (https://huggingface.co/datasets/quora). These datasets are in two versions, the original English version and our-added Czech translation using CUBBITT, the Charles University Block-Backtranslation-Improved Transformer Translation model (https://lindat.mff.cuni.cz/services/translation/). The record includes target labels for Czech datasets as well, however, note that they could no longer be correct for the Czech translation (because of errors made by the translation model). The licence of this record (CC BY-SA) holds for the translated part of the dataset. For the original English datasets, follow their respective licence descriptions.
This item contains 2 files (59.31 MB).
Publicly Available
Author(s):
Description:
The goal of the Natural Language Inference (NLI) task is to determine whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise". The repository contains three NLI datasets, namely snli (https://huggingface.co/datasets/snli), multi_nli (https://huggingface.co/datasets/multi_nli) and qnli (https://huggingface.co/datasets/glue/viewer/qnli/train). These datasets are in two versions, the original English version and our-added Czech translation using CUBBITT, the Charles University Block-Backtranslation-Improved Transformer Translation model (https://lindat.mff.cuni.cz/services/translation/). The record includes target labels for Czech datasets as well, however, note that they could no longer be correct for the Czech translation (because of errors made by the translation model). The licence of this record (CC BY-SA) holds for the translated part of the dataset. For the original English datasets, follow their respective licence descriptions.
This item contains 3 files (100.89 MB).
Publicly Available
Author(s):
Description:
Evaldio for Permanent Residency Permit is a service/tool that provides an automatic speech assessment of the oral part of the Czech language exam at the A2 level. Passing the exam is mandatory for issuing the permanent residency permit in Czechia. The service/tool expects a recording of the exam in the input and outputs the predicted relative score and probability of passing the exam at the A2 level. Furthermore, the service/tool presents the user with the automatic transcription, diarization, and additional statistics.
This item contains 5 files (8.96 MB).
Publicly Available
Most Viewed Items - Last Month
Author(s):
Description:
Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models . To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe . In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
This item contains 96 files (2.61 GB).
Publicly Available
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (740.61 MB).
Publicly Available
Author(s):
show everyone
Description:
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008).
This item contains 3 files (598.92 MB).
Publicly Available