A richly annotated and genre-diversified language resource, The Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0, or PDT-C in short in the sequel) is a consolidated release of the existing PDT-corpora of Czech data, uniformly annotated using the standard PDT scheme. PDT-corpora included in PDT-C: Prague Dependency Treebank (the original PDT contents, written newspaper and journal texts from three genres); Czech part of Prague Czech-English Dependency Treebank (translated financial texts, from English), Prague Dependency Treebank of Spoken Czech (spoken data, including audio and transcripts and multiple speech reconstruction annotation); PDT-Faust (user-generated texts). The difference from the separately published original treebanks can be briefly described as follows: it is published in one package, to allow easier data handling for all the datasets; the data is enhanced with a manual linguistic annotation at the morphological layer and new version of morphological dictionary is enclosed; a common valency lexicon for all four original parts is enclosed. Documentation provides two browsing and editing desktop tools (TrEd and MEd) and the corpus is also available online for searching using PML-TQ.
The Prague Dependency Treebank of Spoken Czech 2.0 (PDTSC 2.0) is a corpus of spoken language, consisting of 742,316 tokens and 73,835 sentences, representing 7,324 minutes (over 120 hours) of spontaneous dialogs. The dialogs have been recorded, transcribed and edited in several interlinked layers: audio recordings, automatic and manual transcripts and manually reconstructed text. These layers were part of the first version of the corpus (PDTSC 1.0). Version 2.0 is extended by an automatic dependency parser at the analytical and by the manual annotation of “deep” syntax at the tectogrammatical layer, which contains semantic roles and relations as well as annotation of coreference.
PDiT 2.0 is a new version of the Prague Discourse Treebank. It contains a complex annotation of discourse phenomena enriched by the annotation of secondary connectives.
The data contains the morphemic dictionary scanned in the PDF format. It is divided into 3 parts:
introductions.pdf - pp. 11-102
main_dictionary.pdf - pp. 113-506
appendices.pdf - pp. 509-645
The file contains all Czech verbs included in the Retrograde Morphemic Dictionary of Czech Language (Slavíčková Eleonora, Academia 1975).
The data was obtained by scanning a portion of the dictionary that contains words ending in -ci and -ti. Among them, there were 18 non-verbs, which were removed. Using OCR, the data was converted into the plain text format and the result was checked by two independent readers. However, if a user encounters a forgotten error, please report.
Trained models for UDPipe used to produce our final submission to the Vardial 2017 CLP shared task (https://bitbucket.org/hy-crossNLP/vardial2017). The SK model was trained on CS data, the HR model on SL data, and the SV model on a concatenation of DA and NO data. The scripts and commands used to create the models are part of separate submission (http://hdl.handle.net/11234/1-1970).
The models were trained with UDPipe version 3e65d69 from 3rd Jan 2017, obtained from
https://github.com/ufal/udpipe -- their functionality with newer or older versions of UDPipe is not guaranteed.
We list here the Bash command sequences that can be used to reproduce our results submitted to VarDial 2017. The input files must be in CoNLLU format. The models only use the form, UPOS, and Universal Features fields (SK only uses the form). You must have UDPipe installed. The feats2FEAT.py script, which prunes the universal features, is bundled with this submission.
SK -- tag and parse with the model:
udpipe --tag --parse sk-translex.v2.norm.feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu
A slightly better after-deadline model (sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe), which we mention in the accompanying paper, is also included. It is applied in the same way (udpipe --tag --parse sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu).
HR -- prune the Features to keep only Case and parse with the model:
python3 feats2FEAT.py Case < hr-ud-predPoS-test.conllu | udpipe --parse hr-translex.v2.norm.Case.w2v.trainonpred.udpipe
NO -- put the UPOS annotation aside, tag Features with the model, merge with the left-aside UPOS annotation, and parse with the model (this hassle is because UDPipe cannot be told to keep UPOS and only change Features):
cut -f1-4 no-ud-predPoS-test.conllu > tmp
udpipe --tag no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe no-ud-predPoS-test.conllu | cut -f5- | paste tmp - | sed 's/^\t$//' | udpipe --parse no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe
Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971).
For each source (SS, e.g. sl) and target (TT, e.g. hr) language,
you need to add the following into this directory:
- treebanks (Universal Dependencies v1.4):
SS-ud-train.conllu
TT-ud-predPoS-dev.conllu
- parallel data (OpenSubtitles from Opus):
OpenSubtitles2016.SS-TT.SS
OpenSubtitles2016.SS-TT.TT
!!! If they are originally called ...TT-SS... instead of ...SS-TT...,
you need to symlink them (or move, or copy) !!!
- target tagging model
TT.tagger.udpipe
All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017
You also need to have:
- Bash
- Perl 5
- Python 3
- word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014
- udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017
- Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016
The most basic setup is the sl-hr one (train_sl-hr.sh):
- normalization of deprels
- 1:1 word-alignment of parallel data with Monolingual Greedy Aligner
- simple word-by-word translation of source treebank
- pre-training of target word embeddings
- simplification of morpho feats (use only Case)
- and finally, training and evaluating the parser
Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in
specific cases (see paper for details).
Moreover, cs-sk also adds more morpho features, selecting those that
seem to be very often shared in parallel data.
The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.
Slovak models for MorphoDiTa, providing morphological analysis, morphological generation and part-of-speech tagging.
The morphological dictionary is created from MorfFlex SK 170914 and the PoS tagger is trained on automatically translated Prague Dependency Treebank 3.0 (PDT).
Simple question answering database version 2.1 (SQAD_v2.1) created from Czech Wikipedia. Each record of SQAD consist of four files (in vertical form provided with lemmatization and POS tagging) and two metadata files.
Ministerstvo školství, mládeže a tělovýchovy České republiky@@LM2015071@@LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat@@nationalFunds@@✖[remove]73