Subject: lemmatizace - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Subject lemmatizace

1. Korpus ORAL: sestavení, lemmatizace a morfologické značkování

Creator:: Kopřivová, Marie, Komrsková, Zuzana, Lukeš, David, and Poukarová, Petra
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: spoken Czech, spoken language corpora, lemmatization, tagging, morphological analysis, mluvená čeština, korpusy mluveného jazyka, lemmatizace, tagování, and morfologická analýza
Language:: Czech
Description:: The goal of this paper is to provide an overview of the structure and contents of the soon-to-be available ORAL corpus, which combines previously published corpora (ORAL2006, ORAL2008 and ORAL2013) with newly transcribed material into a single conveniently accessible and more richly annotated resource, about 6 million running words in length. The recordings and corresponding transcripts span a decade between 2002 and 2011; most of them capture interactions of mutually well-acquainted speakers, in informal situations and natural settings. The corpus is complemented by amarginal portion of more formal data, mostly public talks. It is tagged and lemmatized, and an effort was made to adapt existing tools (targeted at written language) to yield better results on spoken data. We hope the availability of such a resource will spawn further discussions on the morphological and syntactic analysis of spoken language, perhaps resulting in more radical departures in the future from the part-of-speech classification inherited from the linguistic analysis of written language.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

2. Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu

Creator:: Křivan, Jan and Šindlerová, Jana
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: lemmatization, tokenization, morphological annotation, verbal morphology, lemma variants, lemmatizace, tokenizace, morfologická anotace, slovesná morfologie, and varianty lemmatu
Language:: Czech
Description:: This paper introduces some major conceptual enhancements to the morphological annotation of the SYN series corpora of the Czech National Corpus. Apart from minor changes in tokenization and in the positional tagset, three major conceptual changes have been applied which affect the representation of various lexical and grammatical patterns. In the paper, we present the actual impact of the changes in linguistic data and search for possibilities in three linguistic areas. First, the treatment of phonic, graphemic, and morphological variants via a two-tier lemma structure is discussed; second, a new approach to periphrastic verb forms, auxiliaries, participles and the interpretation of verbal grammatical categories through a new attribute, called verbtag, is explained; and third, a complex multi-value treatment of multiword tokens is introduced.
Rights:: http://creativecommons.org/licenses/by-nc-sa/4.0/ and policy:public

1. Korpus ORAL: sestavení, lemmatizace a morfologické značkování

2. Změny v morfologické anotaci korpusů řady SYN: nové možnosti zkoumání české gramatiky a lexikonu

Limit your search

Show values starting with

Search

Search Constraints

Search Results

Limit your search

Coverage

Creator

Format

Language

Rights

Subject

Show values starting with

Type

Original context has metadata only

Harvested from