Zobrazit minimální záznam

 
dc.contributor.author Rosa, Rudolf
dc.contributor.author Zeman, Daniel
dc.contributor.author Mareček, David
dc.contributor.author Žabokrtský, Zdeněk
dc.date.accessioned 2017-04-06T14:33:14Z
dc.date.available 2017-04-06T14:33:14Z
dc.date.issued 2017-01-28
dc.identifier.uri http://hdl.handle.net/11234/1-1970
dc.description Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.
dc.language.iso ces
dc.language.iso slk
dc.language.iso slv
dc.language.iso hrv
dc.language.iso dan
dc.language.iso swe
dc.language.iso nor
dc.publisher Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation info:eu-repo/grantAgreement/EC/H2020/644402
dc.relation.isreferencedby http://web.science.mq.edu.au/~smalmasi/vardial4/pdf/VarDial26.pdf
dc.rights GNU General Public License 2 or later (GPL-2.0)
dc.rights.uri http://opensource.org/licenses/GPL-2.0
dc.subject parsing
dc.subject dependency parser
dc.subject universal dependencies
dc.subject cross-lingual parsing
dc.title Slavic Forest, Norwegian Wood (scripts)
dc.type toolService
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent true
metashare.ResourceInfo#ContentInfo.detailedType suiteOfTools
dc.rights.label PUB
has.files yes
branding LINDAT / CLARIAH-CZ
contact.person Rudolf Rosa rosa@ufal.mff.cuni.cz Charles University, UFAL
sponsor European Union EC/H2020/644402 HimL - Health in my Language euFunds info:eu-repo/grantAgreement/EC/H2020/644402
sponsor Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
sponsor Grantová agentura Univerzity Karlovy v Praze GAUK 15723/2014 Modelování závislostní syntaxe napříč jazyky nationalFunds
sponsor Univerzita Karlova (mimo GAUK) SVV 260 333 Specifický vysokoškolský výzkum nationalFunds
sponsor Grantová agentura České republiky 15-10472S Morphologically and Syntactically Annotated Corpora of Many Languages nationalFunds
files.size 24254
files.count 11


 Soubory tohoto záznamu

 Stáhnout všechny soubory záznamu (23.69 KB)
Licenční kategorie:
Publicly Available

Licence: GNU General Public License 2 or later (GPL-2.0)
Icon
Název
train_sl-hr.sh
Velikost
1.57 KB
Formát
Neznámý
Popis
The full training script for sl-hr
MD5
948900d9e5c936d9ab497675d053beb6
 Stáhnout soubor
Icon
Název
train_cs-sk.sh
Velikost
1.89 KB
Formát
Neznámý
Popis
The full training script for cs-sk
MD5
6810a887f8bdfaf96df06d279452ce7d
 Stáhnout soubor
Icon
Název
train_ds-no.sh
Velikost
2.12 KB
Formát
Neznámý
Popis
The full training script for da+sv-no
MD5
41d4a5deb15b04d06827a0ee9953de18
 Stáhnout soubor
Icon
Název
normalize.pl
Velikost
11.1 KB
Formát
Neznámý
Popis
Deprel normalization
MD5
9211df21bda377f6d62681a48d7614cc
 Stáhnout soubor
Icon
Název
monogreedy_align.sh
Velikost
895 bajtů
Formát
Neznámý
Popis
Word alignment
MD5
415cca16e21a9587ff4d596d1251906c
 Stáhnout soubor
Icon
Název
trtable_src2tgt_feats.py
Velikost
2.12 KB
Formát
Neznámý
Popis
Translation table creation
MD5
43e880128e2fc6c66bdcd3d5835a1d69
 Stáhnout soubor
Icon
Název
translate_conll_src2tgt_feats.py
Velikost
1.16 KB
Formát
Neznámý
Popis
Treebank translation
MD5
792e075d41a9c1889cd0470bbab0c842
 Stáhnout soubor
Icon
Název
feats2FEAT.py
Velikost
412 bajtů
Formát
Neznámý
Popis
Features simplification
MD5
5089de1e63c1aa36cf284bb85600365c
 Stáhnout soubor
Icon
Název
feats2FEAT2xpos.py
Velikost
412 bajtů
Formát
Neznámý
Popis
Features simplification and moving
MD5
9b3f338bf5dc7b822d50e4deaf93f395
 Stáhnout soubor
Icon
Název
get_shared_features.pl
Velikost
973 bajtů
Formát
Neznámý
Popis
Find crosslingually shared features
MD5
dde865ce7b96efabfbabce34d573b3d4
 Stáhnout soubor
Icon
Název
prune_features.pl
Velikost
1.09 KB
Formát
Neznámý
Popis
Keep crosslingually shared features
MD5
9674593e9bd64947bb5ee3a1fc5c5d95
 Stáhnout soubor

Zobrazit minimální záznam