Number of results to display per page
Search Results
272. Question Dialogs Dataset
- Creator:
- Vodolán, Miroslav and Jurčíček, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, other, and lexicalConceptualResource
- Subject:
- question dialogs and interactive learning
- Language:
- English
- Description:
- Dataset collected from natural dialogs which enables to test the ability of dialog systems to interactively learn new facts from user utterances throughout the dialog. The dataset, consisting of 1900 dialogs, allows simulation of an interactive gaining of denotations and questions explanations from users which can be used for the interactive learning.
- Rights:
- Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0), http://creativecommons.org/licenses/by-sa/4.0/, and PUB
273. RobeCzech Base
- Creator:
- Straka, Milan, Náplava, Jakub, Straková, Jana, and Samuel, David
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, mlmodel, and languageDescription
- Subject:
- Czech, BERT, and RoBERTa
- Language:
- Czech
- Description:
- RobeCzech is a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-theart results in four of them. The RobeCzech model is released publicly at https://hdl.handle.net/11234/1-3691 and https://huggingface.co/ufal/robeczech-base, both for PyTorch and TensorFlow.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
274. ROMi 1.0
- Creator:
- Šebesta, Karel, Bedřichová, Zuzanna, Šormová, Kateřina, Straňák, Pavel, and Peterek, Nino
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- audio and corpus
- Subject:
- ethnolect, spoken corpora, and Czech of Romany pupils
- Language:
- Czech
- Description:
- ROMi represents a specific subcorpus of CZESL (Czech as a Second Language). It collects examples of language use, both spoken and written, of Czech Romani children and teen-agers. The range of materials exceeds 1,5 million words. Language Material The material presents uses of spoken language by language-specific group of Romani speakers using Czech as their first language. However, this form of the language is specifically different from Czech as used by the Czech-speaking majority, both on the spoken and secondarily on the written level. It concerns the so-called Romani ethnolect of Czech, i.e. a variety of Czech used by Romani communities mainly in the Czech Republic. We may detect obvious influence of Romani, Slovak and Hungarian. Furthermore, many of the recorded speakers live in social exclusion and thus their language production is influenced by both factors, i.e. by Romani ethnolect and social exclusion. The language material was collected in the years 2009 – 2012 under the Education for Competitiveness Operational Programme, within the framework of the project Innovations of Czech as a Second Language Education collaboratively by the Technical University of Liberec and the Institute of Czech Language and Theory of Communication, Faculty of Arts, Charles University. The language material was processed with support of Institute of Formal and Applied Linguistics - project LINDAT-Clarin. It concerns 110 recordings obtained in various environments – the collection of material took place both in schools and also in several non-profit organizations offering leisure time activities to Romani students. Apart from the school setting, the recordings thus come from the environment of extracurricular activities, sport matches and households. Both the respondents and the collectors are Romani. The samples were acquired in all regions of the Czech Republic, although the majority of recordings were obtained in the Central Bohemia, South Bohemia, Ústí and Vysočina Region. The age of the respondents ranges from 12 to 28 years. The collected samples are also accompanied by metadata relating to the following areas: The collected samples are accompanied by metadata relating to the following areas: • The place of origin (the place of collection, the size of the residence and dialect area, region, environment (school, extracurricular, private); socially excluded locality. • The circumstances of the collection expressing the extent of control exercised by the collector (topic assigned/non-assigned). • The respondent (the age of the student; class/year; sex; type of the school; subjective knowledge of Romani; first language – the one the student considers to be his first; communicative environment in the family – which language(s) is/are used for communication in the family. • The place of data collection – in the case of schools metadata comprise characteristics of the type of school (primary, for students with special needs, remedial, vocational, secondary), the founder (state, church, private organisation), in the case of the place of individual collection of data you may find organisation, interest group markings, etc. • The collector (the abbreviation of collector´s name and his work area, in some cases also his age). Delimiting the group of respondents The respondents are constituted by students of primary schools, schools for students with special needs, secondary schools and by teenagers who have just completed the compulsory education. For the purposes of the language material collection, those students who consider themselves to be Romani or who are considered Romani by others were included to the sample. Moreover, a language criterion was added to this definition - thus those students in whose families Romani is spoken at home were also included. Active knowledge of the Romani language was not required since hardly a third of Romani children living in the Czech Republic nowadays is competent in this language. Ethical aspects of the data collection and processing As regards the content of the language material, it places demands on the data processing from the ethical point of view. Frequently, the texts and recordings feature highly interesting material; the respondents talk about their life stories fully distant or inconceivable for the social majority. During the transcription process, all materials are anonymized and identification data are removed. Field Research When dealing with the environment threatened by social exclusion, it is highly important to consider especially the needs and opportunities of the group members as well as the needs of those individuals, who find themselves or work in such an environment. During the developmental process of the corpus, we became decidedly convinced that it is necessary to accommodate different demands on material quality of texts and recordings and not to overburden both the respondents and the collectors with limiting or impossible requirements. Therefore, the corpus comprises several recordings of lower technical quality which were acquired in the presence of other persons, with the television turned on, etc. Firstly, the recordings would not even have come into existence under different circumstances – it is natural that the interviewing of younger children was taking place directly in their households, in the presence of their parents. Secondly, the recordings would have been made, yet they would have been influenced by the unnaturalness of the situation, consequently affecting the language material. Apart from the interviews with younger children, it regards especially those conversations between the collectros and their peers, e.g. inside leisure time clubs. Characteristics of the recordings The collected recordings come both from the school environment (especially conversations of teacher assistants with individual students) and from the leisure time facilities (interest groups, after-school tutoring). In most cases it concerns conversations of the collector and the individual, alternatively a pair of respondents. The length of the recordings differs, although the majority ranges from 20 to 35 minutes. A single recording approximately contains 2 495 words. The quality of recordings is influenced by the limits of field-utilizable technologies and the effort to increase authenticity to the maximum. Transcription of the recordings The rules for transcription of the recordings are based on similar ones designed for SCHOLA corpus. Transcriptions are carried out by the means of folkloristic transcription, i.e. the closest to the written record, especially adapted for the purposes of computational processing, following the practice established in the Czech National Corpus. The transcription is performed with the help of the Transcriber programme, which connects the sound and graphic track.
- Rights:
- Not specified
275. Self-paced reading experiments on explicit and implicit contrastive and temporal discourse relations in Czech
- Creator:
- Zikánová, Šárka and Smolík, Filip
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, other, and languageDescription
- Subject:
- discourse, psycholinguistic experiments, explicit discourse relations, implicit discourse relations, and self-paced reading
- Language:
- Czech
- Description:
- Supplementary materials for the paper “Processing of explicit and implicit contrastive and temporal discourse relations in Czech” (submitted to Discourse Processes)
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
276. Semantic annotation of noun/verb conversion in Czech
- Creator:
- Ševčíková, Magda, Kyjánek, Lukáš, Hledíková, Hana, and Staňková, Anna
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- other, text, and lexicalConceptualResource
- Subject:
- conversion, semantic, noun, verb, word formation, and Czech
- Language:
- Czech
- Description:
- The item contains a list of 2,058 noun/verb conversion pairs along with related formations (word-formation paradigms) provided with linguistic features, including semantic categories that characterize semantic relations between the noun and the verb in each conversion pair. Semantic categories were assigned manually by two human annotators based on a set of sentences containing the noun and the verb from individual conversion pairs. In addition to the list of paradigms, the item contains a set of 739 files (a separate file for each conversion pair) annotated by the annotators in parallel and a set of 2,058 files containing the final annotation, which is included in the list of paradigms.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), PUB, and http://creativecommons.org/licenses/by-nc-sa/4.0/
277. Semantically annotated sample of Czech and English conversion pairs of verbs and nouns
- Creator:
- Hledíková, Hana
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, wordList, and lexicalConceptualResource
- Subject:
- word-formation, morphology, conversion, semantics, and cognitive
- Language:
- English and Czech
- Description:
- Supplementary files for a comparative study of word-formation without the addition of derivational affixes (conversion) in English and Czech. The two .csv files contain 300 verb-noun conversion pairs in English and 300 verb-noun conversion pairs in Czech, i.e. pairs where either the noun is created from the verb or the verb is created from the noun without the use of derivational affixes. In English, the noun and verb in the conversion pair have the same form. In Czech, the noun and verb in the conversion pair differ in inflectional affixes. The pairs are supplied with manual semantic annotation based on cognitive event schemata. A file with the Appendix includes a list of dictionary definition phrases used as a basis for the semantic annotation.
- Rights:
- Creative Commons - Attribution 4.0 International (CC BY 4.0), http://creativecommons.org/licenses/by/4.0/, and PUB
278. Sentiment Analysis (Czech Model)
- Creator:
- Vysušilová, Petra and Straka, Milan
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text, mlmodel, and languageDescription
- Subject:
- sentiment analysis and BERT
- Language:
- Czech
- Description:
- Sentiment analysis models for Czech language. Models are three Czech sentiment analysis datasets(http://liks.fav.zcu.cz/sentiment/): Mall, CSFD, Facebook, and joint data from all three datasets above, using Czech version of BERT model, RobeCzech. We present the best model for every dataset. Mall and CSFD models are new state-of-the-art for respective data. Demo jupyter notebook is available on the project GitHub. These models are a part of Czech NLP with Contextualized Embeddings master thesis.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
279. SiR 1.0
- Creator:
- Hladká, Barbora, Mírovský, Jiří, Kopp, Matyáš, and Moravec, Václav
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- text and corpus
- Subject:
- news server articles, attribution, attribution signals, attribution sources, and annotation
- Language:
- Czech
- Description:
- SiR 1.0 is a corpus of Czech articles published on iRozhlas, a news server of a Czech public radio (https://www.irozhlas.cz/). It is a collection of 1 718 articles (42 890 sentences, 614 995 words) with manually annotated attribution of citation phrases and sources. The sources are classified into several classes of named and unnamed sources. The corpus consists of three parts, depending on the quality of the annotations: (i) triple-annotated articles: 46 articles (933 sentences, 13 242 words) annotated independently by three annotators and subsequently curated by an arbiter, (ii) double-annotated articles: 543 articles (12 347 sentences, 180 622 words) annotated independently by two annotators and automatically unified, and (iii) single-annotated articles: 1 129 articles (29 610 sentences, 421 131 words) annotated each only by a single annotator. The data were annotated in the Brat tool (https://brat.nlplab.org/) and are distributed in the Brat native format, i.e. each article is represented by the original plain text and a stand-off annotation file. Please cite the following paper when using the corpus for your research: Hladká Barbora, Jiří Mírovský, Matyáš Kopp, Václav Moravec. Annotating Attribution in Czech News Server Articles. In: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 1817–1823, Marseille, France 20-25 June 2022.
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB
280. Slavic Forest, Norwegian Wood (models)
- Creator:
- Rosa, Rudolf, Zeman, Daniel, Mareček, David, and Žabokrtský, Zdeněk
- Publisher:
- Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
- Type:
- other and toolService
- Subject:
- parsing, dependency parser, cross-lingual parsing, and universal dependencies
- Language:
- Slovak, Croatian, and Norwegian
- Description:
- Trained models for UDPipe used to produce our final submission to the Vardial 2017 CLP shared task (https://bitbucket.org/hy-crossNLP/vardial2017). The SK model was trained on CS data, the HR model on SL data, and the SV model on a concatenation of DA and NO data. The scripts and commands used to create the models are part of separate submission (http://hdl.handle.net/11234/1-1970). The models were trained with UDPipe version 3e65d69 from 3rd Jan 2017, obtained from https://github.com/ufal/udpipe -- their functionality with newer or older versions of UDPipe is not guaranteed. We list here the Bash command sequences that can be used to reproduce our results submitted to VarDial 2017. The input files must be in CoNLLU format. The models only use the form, UPOS, and Universal Features fields (SK only uses the form). You must have UDPipe installed. The feats2FEAT.py script, which prunes the universal features, is bundled with this submission. SK -- tag and parse with the model: udpipe --tag --parse sk-translex.v2.norm.feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu A slightly better after-deadline model (sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe), which we mention in the accompanying paper, is also included. It is applied in the same way (udpipe --tag --parse sk-translex.v2.norm.Case-feats07.w2v.trainonpred.udpipe sk-ud-predPoS-test.conllu). HR -- prune the Features to keep only Case and parse with the model: python3 feats2FEAT.py Case < hr-ud-predPoS-test.conllu | udpipe --parse hr-translex.v2.norm.Case.w2v.trainonpred.udpipe NO -- put the UPOS annotation aside, tag Features with the model, merge with the left-aside UPOS annotation, and parse with the model (this hassle is because UDPipe cannot be told to keep UPOS and only change Features): cut -f1-4 no-ud-predPoS-test.conllu > tmp udpipe --tag no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe no-ud-predPoS-test.conllu | cut -f5- | paste tmp - | sed 's/^\t$//' | udpipe --parse no-translex.v2.norm.tgttagupos.srctagfeats.Case.w2v.udpipe
- Rights:
- Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0), http://creativecommons.org/licenses/by-nc-sa/4.0/, and PUB