The valency lexicon PDT-Vallex has been built in close connection with the annotation of the Prague Dependency Treebank project (PDT) and its successors (mainly the Prague Czech-English Dependency Treebank project, PCEDT). It contains over 11000 valency frames for more than 7000 verbs which occurred in the PDT or PCEDT. It is available in electronically processable format (XML) together with the aforementioned treebanks (to be viewed and edited by TrEd, the PDT/PCEDT main annotation tool), and also in more human readable form including corpus examples (see the WEBSITE link below). The main feature of the lexicon is its linking to the annotated corpora - each occurrence of each verb is linked to the appropriate valency frame with additional (generalized) information about its usage and surface morphosyntactic form alternatives.
From a case study, a kind of manifesto grows in this article - or a challenge to discuss the principles of axiology in a corpus-based grammar. Part 1 (introduction) presents some facts about a group of Czech village names. One of them has been used frequently in the media last year, not always in accordance with language handbooks; Part 2 records this phenomena. Part 3 sketches how this phenomena would be treated in the spirit of laissez-faire linguistics. Part 4 starts with a reminder that there are not only language phenomena in corpora, but errors as well. Then, the axiology is presented as observation of values (a) in the national language, (b) in texts, (c) in language description. A description within a badly needed axiologic frame is claimed and demonstrated, where language phenomena would be evaluated not only after mere frequencies, but also depending on qualities of source texts. Part 5 adumbrates a broader frame and some parallels of other disciplines where the description of human practice differs from theoretical postulates. Part 6 specifies the role this journal hopes to play in further discussions: about the use of corpora in a grammar research, about criteria of marking language phenomena, about distinguishing innovations from errors, about values of single language phenomena.
Both historical and recent developments of quantitative research in linguistics brought out a great amount of data without a unifying method. The older data have been computed mainly by hand from limited samples of shorter texts, with limited possibilities of data combinations. Newer data based on large corpora offer a great number of quantitative characteristics even in the most different combinations, but they have been mainly extracted from heterogeneous text materials. Statistically, the older data can be considered as less exact. New data, with respect to enormous extent of corpora, can be considered as most exact. Therefore, problems arise not only because of the above mentioned methodological disparities of old and new approaches of computation, but also because of different details studied or because of limited possibilities of direct comparison. Deeper statistical and probabilistic questions arise too, and their discussion should not be ignored.
The possibility to search electronically very large corpora of texts has opened up ways in which we can truly evaluate the rules through which grammarians have tried and continue to try to simulate natural languages. However, the possibility to handle incredibly large amounts of texts might lead to problems with the assessment of certain phenomena that are hardly ever represented in those corpora and yet, have always been regarded as grammatically correct elements of a given language. In German, typical phenomena of this kind are forms like betrögest or erwögest, i.e. second person singular of the so-called strong verbs in the subjunctive mood. Should we see them merely as grammarians’ inventions? Before doing so, we should reconsider the nature of these phenomena. They may appear to be isolated word forms but, in fact, are compact realizations of syntactic constructions, and it is the frequency of these constructions that should be evaluated, not the frequency of their specific realizations. and Možnost prohledávat velmi rozsáhlé korpusy textů pomocí elektronických nástrojů ukazuje cesty, jak evaluovat pravidla, jimiž se lingvisté snažili a stále snaží simu-lovat přirozený jazyk. Avšak možnost zpracovávat obrovské množství textů může přiná-šet problémy, jak hodnotit jisté jevy, jež se i v takto velkých korpusech nikdy nevyskytly, přestože byly vždy považovány za gramaticky korektní elementy daného jazyka. V němčině jsou typickými prvky tohoto druhu tvary jako betrögest nebo erwögest, tj. 2. os. sg. konjunktivu préterita tzv. silných sloves. Máme se na ně dívat jako na pouhý výmysl gramatiků? Než tak učiníme, měli bychom znovu zhodnotit povahu těchto jevů. Může se zdát, že jde o izolované slovní formy, avšak ve skutečnosti jde o kondenzované realizace syntaktických konstrukcí, a proto bychom měli hodnotit frekvenci těchto konstrukcí, ni-koli frekvenci jejich specifických realizací.