Slavic Forest, Norwegian Wood (scripts)

Rosa, Rudolf; Zeman, Daniel; Mareček, David; Žabokrtský, Zdeněk

dc.contributor.author	Rosa, Rudolf
dc.contributor.author	Zeman, Daniel
dc.contributor.author	Mareček, David
dc.contributor.author	Žabokrtský, Zdeněk
dc.date.accessioned	2017-04-06T14:33:14Z
dc.date.available	2017-04-06T14:33:14Z
dc.date.issued	2017-01-28
dc.identifier.uri	http://hdl.handle.net/11234/1-1970
dc.description	Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.
dc.language.iso	ces
dc.language.iso	slk
dc.language.iso	slv
dc.language.iso	hrv
dc.language.iso	dan
dc.language.iso	swe
dc.language.iso	nor
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation	info:eu-repo/grantAgreement/EC/H2020/644402
dc.relation.isreferencedby	http://web.science.mq.edu.au/~smalmasi/vardial4/pdf/VarDial26.pdf
dc.rights	GNU General Public License 2 or later (GPL-2.0)
dc.rights.uri	http://opensource.org/licenses/GPL-2.0
dc.subject	parsing
dc.subject	dependency parser
dc.subject	universal dependencies
dc.subject	cross-lingual parsing
dc.title	Slavic Forest, Norwegian Wood (scripts)
dc.type	toolService
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
metashare.ResourceInfo#ContentInfo.detailedType	suiteOfTools
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
contact.person	Rudolf Rosa rosa@ufal.mff.cuni.cz Charles University, UFAL
sponsor	European Union EC/H2020/644402 HimL - Health in my Language euFunds info:eu-repo/grantAgreement/EC/H2020/644402
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
sponsor	Grantová agentura Univerzity Karlovy v Praze GAUK 15723/2014 Modelování závislostní syntaxe napříč jazyky nationalFunds
sponsor	Univerzita Karlova (mimo GAUK) SVV 260 333 Specifický vysokoškolský výzkum nationalFunds
sponsor	Grantová agentura České republiky 15-10472S Morphologically and Syntactically Annotated Corpora of Many Languages nationalFunds
files.size	24254
files.count	11