Automatic text classification is a very important task that consists in assigning labels (categories, groups, classes) to a given text based on a set of previously labeled texts called training set. The work presented in this paper treats the problem of automatic topical text categorization. It is a supervised classification because it works on a predefined set of classes and topical because it uses topics or subjects of texts as classes. In this context, we used a new approach based on $k$-NN algorithm, as well as a new set of pseudo-distances (distance metrics) known in the field of language identification. We also proposed a simple and effective method to improve the quality of performed categorization.
With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Many classification methods have been applied to text categorization. The k-nearest neighbors (k-NN) is known to be one of the best state of the art classifiers when used for text categorization. However, k-NN suffers from limitations such as high computation, low tolerance to noise, and its dependency to the parameter k and distance function. In this paper, we first survey some improvements algorithms proposed in the literature to face those shortcomings. And second, we discuss an approach to improve k-NN efficiency without degrading the performance of classification. Experimental results on the 20 Newsgroup and Reuters corpora show that the proposed approach increases the performance of k-NN and reduces the time classification.
TECAT is a command-line tool for multi-label text categorization and evaluation. It is capable of combining multiple bases binary classifiers (built-in and external ones).