Czech Grammar Agreement Dataset for Evaluation of Language Models

Name: Czech Grammar Agreement Dataset for Evaluation of Language Models
License: http://creativecommons.org/licenses/by-sa/4.0/

Baisa, Vít

Show simple item record

dc.contributor.author	Baisa, Vít
dc.date.accessioned	2017-01-10T19:42:18Z
dc.date.available	2017-01-10T19:42:18Z
dc.date.issued	2016-12-02
dc.identifier.uri	http://hdl.handle.net/11234/1-1933
dc.description	AGREE is a dataset and task for evaluation of language models based on grammar agreement in Czech. The dataset consists of sentences with marked suffixes of past tense verbs. The task is to choose the right verb suffix which depends on gender, number and animacy of subject. It is challenging for language models because 1) Czech is morphologically rich, 2) it has relatively free word order, 3) high out-of-vocabulary (OOV) ratio, 4) predicate and subject can be far from each other, 5) subjects can be unexpressed and 6) various semantic rules may apply. The task provides a straightforward and easily reproducible way of evaluating language models on a morphologically rich language.
dc.language.iso	ces
dc.publisher	Masaryk University, NLP Centre
dc.rights	Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.rights.uri	http://creativecommons.org/licenses/by-sa/4.0/
dc.source.uri	https://www.muni.cz/vyzkum/publikace/1362555
dc.subject	agreement
dc.subject	past tense verb suffix
dc.subject	language model
dc.subject	training data
dc.title	Czech Grammar Agreement Dataset for Evaluation of Language Models
dc.type	corpus
metashare.ResourceInfo#ContentInfo.mediaType	text
dc.rights.label	PUB
has.files	yes
branding	LINDAT / CLARIAH-CZ
demo.uri	https://nlp.fi.muni.cz/~xbaisa/agree/
contact.person	Vít Baisa xbaisa@fi.muni.cz Masaryk University, NLP Centre
size.info	10000000 sentences
files.size	391120182
files.count	1