This paper is based on a study which was conducted within the research grant ''Institutions in Life Stories. Multilevel Comparative Analysis of Biographical Narratives of Three Groups of Participants in Czech Society in 20th Century''. The aim of this research was both to describe one possible way of using a corpus to identify relevant differences between three types of text (in this case biographical narratives of three groups of speakers: communist officials, dissidents and so-called common people) and to serve as a basis for further analysis (be it a linguistic, sociological or historical analysis). We tried to point out typical features of the language of each group based on the most frequent expressions (nouns, adjectives etc.) and especially collocations. We also compared the corpus Příběhy (Stories) as a whole with the ORAL2008 corpus of synchronic spoken Czech, the SYN2005 corpus of synchronic written Czech and the Totalita corpus (a corpus of communist propaganda).
Monocollocable words are such words and word forms that occur in a single lexical combination only or in very few, whose number is severely restricted and set. Practically, they are found as parts of set idioms and multi-word terms. They are found in many other languages, cf. English tenterhooks or Russian bakluši. Czech examples dát/dostat najevo, na viděnou, je mi líto, říct/mluvit/hrát nahlas, je známo, je zapotřebí, být třešničkou na dortu, není divu, jít/chodit pěšky, dát/dostat zadarmo illustrate this in more detail, showing, at the same time, that there might be a limited variation found, too, but, above all, that these are, in fact, no full-fledged words, lacking most of their basic characteristics, such as meaning, word-class membership, etc. In the sense of their severely limited combinatorial capacity, these words, less known under such alternative labels as cranberry words, form a substantial and irregular periphery of language and its lexicon. The contribution briefly comments on some of their aspects and suggests that broadly some classes or types can be recognized.
This contribution, which in a brief, succint and almost aphoristic way, critically brings forward to the reader a number of problems of today’s corpus and computational linguistics as well as their unsatisfactory solutions, is trying, at the same time, to do away with a number of myths and simplified opinions in the field. and Příspěvek ve stručné a téměř aforizované podobě připomíná řadu kritizovaných problémů a jejich neuspokojivých řešení v dnešní korpusové a komputační lingvistice a snaží se tak odstranit řadu mýtů a zjednodušujících představ.