The paper presents results of GUHA analysis of proteomic data. The data are related to an oncological study on breast cancer and are given by 2D electrophoresis gels carrying expression intensity of proteins in cancer cells. The gels have been classified by a physician according to the clinical course of the tumor disease. A research task is aimed on search for significant relations between protein spot intensities and respective clinical presentation. The task was solved by the GUHA method of data mining.
Most of the traditional clustering algorithms are poor for clustering more complex structures other than the convex spherical sample space. In the past few years, several spectral clustering algorithms were proposed to cluster arbitrarily shaped data in various real applications. However, spectral clustering relies on the dataset where each cluster is approximately well separated to a certain extent. In the case that the cluster has an obvious inflection point within a non-convex space, the spectral clustering algorithm would mistakenly recognize one cluster to be different clusters. In this paper, we propose a novel spectral clustering algorithm called HSC combined with hierarchical method, which obviates the disadvantage of the spectral clustering by not using the misleading information of the noisy neighboring data points. The simple clustering procedure is applied to eliminate the misleading information, and thus the HSC algorithm could cluster both convex shaped data and arbitrarily shaped data more efficiently and accurately. The experiments on both synthetic data sets and real data sets show that HSC outperforms other popular clustering algorithms. Furthermore, we observed that HSC can also be used for the estimation of the number of clusters.
Matrix factorization or factor analysis is an important task helpful in the analysis of high dimensional real world data. There are several well known methods and algorithms for factorization of real data but many application areas including information retrieval, pattern recognition and data mining require processing of binary rather than real data. Unfortunately, the methods used for real matrix factorization fail in the latter case. In this paper we introduce background and initial version of Genetic Algorithm for binary matrix factorization.
This paper is a contribution to the theoretical foundations of data mining. More precisely, we contribute to a part of data mining allowing us to search for associations among attributes that can be expressed in the form of natural language sentences. The theoretical background and also a method for mining such associations was published recently in [V. Novák et al., Mining pure linguistic associations from numerical data, Int. Journal of Approximate Reasoning 48 (2008), 4 -- 22]. We elaborated other mathematical representations of the model presented in the mentioned paper in order to extend its applicability.