This paper is based on a study which was conducted within the research grant ''Institutions in Life Stories. Multilevel Comparative Analysis of Biographical Narratives of Three Groups of Participants in Czech Society in 20th Century''. The aim of this research was both to describe one possible way of using a corpus to identify relevant differences between three types of text (in this case biographical narratives of three groups of speakers: communist officials, dissidents and so-called common people) and to serve as a basis for further analysis (be it a linguistic, sociological or historical analysis). We tried to point out typical features of the language of each group based on the most frequent expressions (nouns, adjectives etc.) and especially collocations. We also compared the corpus Příběhy (Stories) as a whole with the ORAL2008 corpus of synchronic spoken Czech, the SYN2005 corpus of synchronic written Czech and the Totalita corpus (a corpus of communist propaganda).
Monocollocable words are such words and word forms that occur in a single lexical combination only or in very few, whose number is severely restricted and set. Practically, they are found as parts of set idioms and multi-word terms. They are found in many other languages, cf. English tenterhooks or Russian bakluši. Czech examples dát/dostat najevo, na viděnou, je mi líto, říct/mluvit/hrát nahlas, je známo, je zapotřebí, být třešničkou na dortu, není divu, jít/chodit pěšky, dát/dostat zadarmo illustrate this in more detail, showing, at the same time, that there might be a limited variation found, too, but, above all, that these are, in fact, no full-fledged words, lacking most of their basic characteristics, such as meaning, word-class membership, etc. In the sense of their severely limited combinatorial capacity, these words, less known under such alternative labels as cranberry words, form a substantial and irregular periphery of language and its lexicon. The contribution briefly comments on some of their aspects and suggests that broadly some classes or types can be recognized.
This contribution, which in a brief, succint and almost aphoristic way, critically brings forward to the reader a number of problems of today’s corpus and computational linguistics as well as their unsatisfactory solutions, is trying, at the same time, to do away with a number of myths and simplified opinions in the field. and Příspěvek ve stručné a téměř aforizované podobě připomíná řadu kritizovaných problémů a jejich neuspokojivých řešení v dnešní korpusové a komputační lingvistice a snaží se tak odstranit řadu mýtů a zjednodušujících představ.
Balanced corpus of contemporary written Czech sized 100 MW. It was created as a representation of written language from 2000–2004 and thus it contains a wide range of text types and genres (fiction, professional literature, newspapers etc.) in balanced proportions. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků
Corpus of contemporary Czech newspapers and magazines sized 300 MW. It contains various titles published between the end of 1989 and 2004. The corpus is lemmatized and morphologically tagged by a combination of stochastic and rule-based methods.
The corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via query interface to registered users of the CNC with one important exception: they are shuffled, i.e. divided into blocks sized max. 100 words (respecting the sentence boundaries) whose ordering was randomized within the given document. and MSM0021620823 – Český národní korpus a korpusy dalších jazyků