Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

Libovický, Jindřich; Rosa, Rudolf; Helcl, Jindřich; Popel, Martin

dc.contributor.author	Libovický, Jindřich
dc.contributor.author	Rosa, Rudolf
dc.contributor.author	Helcl, Jindřich
dc.contributor.author	Popel, Martin
dc.date.accessioned	2020-01-10T09:43:29Z
dc.date.available	2020-01-10T09:43:29Z
dc.date.issued	2020-01-07
dc.identifier.uri	http://hdl.handle.net/11234/1-3145
dc.description	This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture. There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased. Feel free to contact the authors of this submission in case you run into problems!
dc.language.iso	ces
dc.language.iso	eng
dc.publisher	Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
dc.relation.isreferencedby	http://ceur-ws.org/Vol-2203/138.pdf
dc.relation.replaces	http://hdl.handle.net/11234/1-2839
dc.rights	Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.source.uri	https://ufal.mff.cuni.cz/grants/lsd
dc.subject	sentiment analysis
dc.subject	machine translation
dc.subject	image captioning
dc.subject	neural networks
dc.subject	transformer
dc.subject	Neural Monkey
dc.subject	summarization
dc.title	Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)
dc.type	toolService
metashare.ResourceInfo#ResourceComponentType#ToolServiceInfo.languageDependent	true
metashare.ResourceInfo#ContentInfo.detailedType	suiteOfTools
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
demo.uri	https://ufal.mff.cuni.cz/grants/lsd
contact.person	Jindřich Libovický libovicky@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
contact.person	Rudolf Rosa rosa@ufal.mff.cuni.cz Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
sponsor	GA ČR 18-02196S Reprezentace lingvistické struktury v neuronových sítích nationalFunds
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky LM2015071 LINDAT/CLARIN: Institut pro analýzu, zpracování a distribuci lingvistických dat nationalFunds
sponsor	Ministerstvo školství, mládeže a tělovýchovy České republiky CZ.02.1.01/0.0/0.0/16_013/0001781 LINDAT/CLARIN - Výzkumná infrastruktura pro jazykové technologie - rozšíření repozitáře a výpočetní kapacity nationalFunds
sponsor	Univerzita Karlova (mimo GAUK) SVV 260 453 Specifický vysokoškolský výzkum nationalFunds
sponsor	GAUK 976518 Využití lingvistické informace v neuronovém strojovém překladu ownFunds
files.size	4328681659
files.count	8

Files in this item

This item is

Publicly Available

and licensed under:
Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Name: sentiment_en_yelp_rnn_san.zip
Size: 119.3 MB
Format: application/zip
Description: English sentiment analysis (Yelp)
MD5: 13c60ad183ef745c0cb516f87beaacf4

Download file Preview

File Preview

sentiment_en_yelp_rnn_san
- variables.data.index1 kB
- experiment.log1 MB
- run.ini1 kB
- original.ini2 kB
- variables.data.meta775 kB
- classes.txt10 B
- git_diff0 B
- variables.data.best14 B
- args68 B
- experiment.ini1 kB
- variables.data.data-00000-of-00001128 MB
- git_commit41 B
- checkpoint181 B
- vocabulary30k.txt363 kB

Name: sentiment_cs_csfd_rnn_san.zip
Size: 184.5 MB
Format: application/zip
Description: Czech sentiment analysis (ČSFD)
MD5: c94f110b9330f3e30c5fb1f4b49e81d6

Download file Preview

File Preview

sentiment_cs_csfd_rnn_san
- variables.data.index1 kB
- experiment.log113 kB
- run.ini1 kB
- original.ini2 kB
- variables.data.meta654 kB
- classes.txt26 B
- git_diff0 B
- variables.data.best14 B
- args68 B
- variables.data.data-00000-of-00001197 MB
- experiment.ini1 kB
- git_commit41 B
- vocabulary50k.txt588 kB
- checkpoint181 B

Name: translation_encs_transformer.zip
Size: 1.76 GB
Format: application/zip
Description: English-to-Czech machine translation
MD5: 3057c5e1a17ca03e533a20b525140dc7

Download file Preview

File Preview

translation_encs_transformer
- checkpoint85 B
- variables.data.meta1 GB
- variables.data.best15 B
- experiment.ini1 kB
- vocab296 kB
- variables.data.data-00000-of-00001800 MB
- variables.data.index10 kB
- preprocess.ini281 B

Name: captioning_cs_bigger.zip
Size: 366.38 MB
Format: application/zip
Description: Czech image captioning
MD5: c76a42de20d9ba675f25f5d8f2f82627

Download file Preview

File Preview

captioning_cs_bigger
- vocab.cs60 kB
- experiment.log72 kB
- run.ini1 kB
- original.ini2 kB
- variables.data.avg-0.data-00000-of-00001395 MB
- variables.data.avg-0.meta268 kB
- git_diff0 B
- args53 B
- variables.data.best21 B
- experiment.ini2 kB
- git_commit41 B
- checkpoint97 B
- variables.data.avg-0.index2 kB

Name: captioning_en_multiref_bigger.zip
Size: 399.45 MB
Format: application/zip
Description: English image captioning
MD5: 1baa582d527ea5f758ac5f8748c172c4

Download file Preview

File Preview

captioning_en_multiref_bigger
- experiment.log3 MB
- run.ini1 kB
- original.ini2 kB
- variables.data.avg-0.data-00000-of-00001431 MB
- variables.data.avg-0.meta268 kB
- git_diff0 B
- variables.data.best21 B
- args53 B
- experiment.ini2 kB
- git_commit41 B
- checkpoint97 B
- variables.data.avg-0.index2 kB
- en.vocab77 kB

Name: resnet.zip
Size: 83.65 MB
Format: application/zip
Description: ResNet
MD5: 8d67aaecdf30b75d08bd6babf70d5237

Download file Preview

File Preview

resnet
- variables.data.index10 kB
- experiment.log19 kB
- run.ini621 B
- original.ini1 kB
- variables.data.meta1 MB
- git_diff2 kB
- variables.data.best15 B
- args83 B
- variables.data.data-00000-of-0000189 MB
- experiment.ini826 B
- git_commit41 B
- checkpoint85 B

Name: cnn-daily-mail-rnn-rnn.zip
Size: 586.55 MB
Format: application/zip
MD5: 59ce9b2cd3d7b7f0e1e3cfbd5706a446

Download file Preview

File Preview

cnn-daily-mail-rnn-rnn
- variables.data.index4 kB
- experiment.log76 MB
- original.ini2 kB
- variables.data.meta1 MB
- git_diff977 B
- variables.data.best14 B
- args59 B
- variables.data.data-00000-of-00001813 MB
- experiment.ini2 kB
- git_commit41 B
- checkpoint211 B

Name: sumeczech-rnn-rnn.zip
Size: 588.76 MB
Format: application/zip
MD5: b6ccd233449a6509004c742b4f923eea

Download file Preview

File Preview

sumeczech-rnn-rnn
- variables.data.index4 kB
- experiment.log63 MB
- original.ini2 kB
- variables.data.meta1 MB
- git_diff977 B
- variables.data.best14 B
- args54 B
- variables.data.data-00000-of-00001813 MB
- experiment.ini1 kB
- git_commit41 B
- checkpoint201 B

Show simple item record