Slavic Forest, Norwegian Wood (scripts)
Please use the following text to cite this item or export to a predefined format:
Rosa, Rudolf; Zeman, Daniel; Mareček, David and Žabokrtský, Zdeněk, 2017,
Slavic Forest, Norwegian Wood (scripts), LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL),
http://hdl.handle.net/11234/1-1970.
Authors
Item identifier
Date issued
2017-01-28
Type
Description
Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971).
For each source (SS, e.g. sl) and target (TT, e.g. hr) language,
you need to add the following into this directory:
- treebanks (Universal Dependencies v1.4):
SS-ud-train.conllu
TT-ud-predPoS-dev.conllu
- parallel data (OpenSubtitles from Opus):
OpenSubtitles2016.SS-TT.SS
OpenSubtitles2016.SS-TT.TT
!!! If they are originally called ...TT-SS... instead of ...SS-TT...,
you need to symlink them (or move, or copy) !!!
- target tagging model
TT.tagger.udpipe
All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017
You also need to have:
- Bash
- Perl 5
- Python 3
- word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014
- udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017
- Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016
The most basic setup is the sl-hr one (train_sl-hr.sh):
- normalization of deprels
- 1:1 word-alignment of parallel data with Monolingual Greedy Aligner
- simple word-by-word translation of source treebank
- pre-training of target word embeddings
- simplification of morpho feats (use only Case)
- and finally, training and evaluating the parser
Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in
specific cases (see paper for details).
Moreover, cs-sk also adds more morpho features, selecting those that
seem to be very often shared in parallel data.
The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.
Acknowledgement
European Union
Project code:EC/H2020/644402
Project name:HimL - Health in my Language
Ministerstvo školství, mládeže a tělovýchovy České republiky
Project code:LM2015071
Project name:LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat
Grantová agentura Univerzity Karlovy v Praze
Project code:GAUK 15723/2014
Project name:Modelování závislostní syntaxe napříč jazyky
Univerzita Karlova (mimo GAUK)
Project code:SVV 260 333
Project name:Specifický vysokoškolský výzkum
Grantová agentura České republiky
Project code:15-10472S
Project name:Morphologically and Syntactically Annotated Corpora of Many Languages
Collections
Files in this item
- Name
- feats2FEAT.py
- Size
- 412 B
- Format
- application/octet-stream
- Description
- Features simplification
- MD5
- 5089de1e63c1aa36cf284bb85600365c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- trtable_src2tgt_feats.py
- Size
- 2.12 KB
- Format
- application/octet-stream
- Description
- Translation table creation
- MD5
- 43e880128e2fc6c66bdcd3d5835a1d69

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- normalize.pl
- Size
- 11.1 KB
- Format
- application/octet-stream
- Description
- Deprel normalization
- MD5
- 9211df21bda377f6d62681a48d7614cc

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- translate_conll_src2tgt_feats.py
- Size
- 1.16 KB
- Format
- application/octet-stream
- Description
- Treebank translation
- MD5
- 792e075d41a9c1889cd0470bbab0c842

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- get_shared_features.pl
- Size
- 973 B
- Format
- application/octet-stream
- Description
- Find crosslingually shared features
- MD5
- dde865ce7b96efabfbabce34d573b3d4

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- prune_features.pl
- Size
- 1.09 KB
- Format
- application/octet-stream
- Description
- Keep crosslingually shared features
- MD5
- 9674593e9bd64947bb5ee3a1fc5c5d95

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- feats2FEAT2xpos.py
- Size
- 412 B
- Format
- application/octet-stream
- Description
- Features simplification and moving
- MD5
- 9b3f338bf5dc7b822d50e4deaf93f395

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- train_cs-sk.sh
- Size
- 1.89 KB
- Format
- application/octet-stream
- Description
- The full training script for cs-sk
- MD5
- 6810a887f8bdfaf96df06d279452ce7d

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- train_ds-no.sh
- Size
- 2.12 KB
- Format
- application/octet-stream
- Description
- The full training script for da+sv-no
- MD5
- 41d4a5deb15b04d06827a0ee9953de18

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- monogreedy_align.sh
- Size
- 895 B
- Format
- application/octet-stream
- Description
- Word alignment
- MD5
- 415cca16e21a9587ff4d596d1251906c

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz
- Name
- train_sl-hr.sh
- Size
- 1.57 KB
- Format
- application/octet-stream
- Description
- The full training script for sl-hr
- MD5
- 948900d9e5c936d9ab497675d053beb6

The file preview has not been generated yet. Please try again later or contact the system administrator lindat-help@ufal.mff.cuni.cz

