Subject: text categorization - LINDAT/CLARIAH-CZ Catalog Search Results

Start Over Subject text categorization

Creator:: Gadri, S. and Moussaoui, A.
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: N -grams, language identification, text categorization, text mining, machine learning, Kullback-Leibler distance, X2 distance, and Cavnar-Trenkle distance
Language:: English
Description:: Automatic text classification is a very important task that consists in assigning labels (categories, groups, classes) to a given text based on a set of previously labeled texts called training set. The work presented in this paper treats the problem of automatic topical text categorization. It is a supervised classification because it works on a predefined set of classes and topical because it uses topics or subjects of texts as classes. In this context, we used a new approach based on $k$-NN algorithm, as well as a new set of pseudo-distances (distance metrics) known in the field of language identification. We also proposed a simple and effective method to improve the quality of performed categorization.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

Creator:: Barigou, F.
Format:: bez média and svazek
Type:: model:article and TEXT
Subject:: text categorization, k-nearest neighbors, cellular automaton, and efficiency
Language:: English
Description:: With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Many classification methods have been applied to text categorization. The k-nearest neighbors (k-NN) is known to be one of the best state of the art classifiers when used for text categorization. However, k-NN suffers from limitations such as high computation, low tolerance to noise, and its dependency to the parameter k and distance function. In this paper, we first survey some improvements algorithms proposed in the literature to face those shortcomings. And second, we discuss an approach to improve k-NN efficiency without degrading the performance of classification. Experimental results on the 20 Newsgroup and Reuters corpora show that the proposed approach increases the performance of k-NN and reduces the time classification.
Rights:: http://creativecommons.org/publicdomain/mark/1.0/ and policy:public

Creator:: Montejo-Ráez, Arturo
Publisher:: European Organization for Nuclear Research (CERN) and University of Jaén (Spain)
Type:: toolService
Subject:: text categorization
Description:: TECAT is a command-line tool for multi-label text categorization and evaluation. It is capable of combining multiple bases binary classifiers (built-in and external ones).
Rights:: Not specified

Limit your search