A single-step information-theoretic algorithm that is able to identify possible clusters in dataset is presented. The proposed algorithm consists in representation of data scatter in terms of similarity-based data point entropy and probability descriptions. By using these quantities, an information-theoretic association metric called mutual ambiguity between data points is defined, which then is to be employed in determining particular data points called cluster identifiers. For forming individual clusters corresponding to cluster identifiers determined as such, a cluster relevance rule is defined. Since cluster identifiers and associative cluster member data points can be identified without recursive or iterative search, the algorithm is single-step. The algorithm is tested and justified with experiments by using synthetic and anonymous real datasets. Simulation results demonstrate that the proposed algorithm also exhibits more reliable performance in statistical sense compared to major algorithms.
We investigate Solutions provided by the finite-context predictive model called neural prediction machine (NPM) built on the recurrent layer of two types of recurrent neural networks (RNNs). One type is the first-order Elman’s simple recurrent network (SRN) trained for the next symbol prediction by the technique of extended Kalman filter (EKF). The other type of RNN is an interesting unsupervised counterpart to the “claissical” SRN, that is a recurrent version of the Bienenstock, Cooper, Munro (BCM) network that performs a kind of time-conditional projection pursuit. As experimental data we chose a complex symbolic sequence with both long and short memory structures. We compared the Solutions achieved by both types of the RNNs with Markov models to find out whether training can improve initial Solutions reached by random network dynamics that can be interpreted as an iterated function system (IFS). The results of our simulations indicate that SRN trained by EKF achieves better next symbol prediction than its unsupervised counterpart. Recurrent BCM network can provide only the Markovian solution that is not able to cover long memory structures in sequence and thus beat SRN.
Probabilistic mixtures provide flexible "universal'' approximation of probability density functions. Their wide use is enabled by the availability of a range of efficient estimation algorithms. Among them, quasi-Bayesian estimation plays a prominent role as it runs "naturally'' in one-pass mode. This is important in on-line applications and/or extensive databases. It even copes with dynamic nature of components forming the mixture. However, the quasi-Bayesian estimation relies on mixing via constant component weights. Thus, mixtures with dynamic components and dynamic transitions between them are not supported. The present paper fills this gap. For the sake of simplicity and to give a better insight into the task, the paper considers mixtures with known components. A general case with unknown components will be presented soon.
Semi-dry grasslands are of high nature conservation interest both at national and European scales due to their high biodiversity and species richness. For effective conservation, however, the variation in floristic composition and distribution of these grasslands need first to be described. In Hungary, there is currently no comprehensive survey and classification of semi-dry grasslands. Therefore, the aim of this study was to (i) describe the variation in species composition of Hungarian semi-dry grasslands by a country-scale cluster analysis based on a large database; (ii) describe the types (clusters) and compare these descriptions with those in the phytosociological literature, and finally (iii) formulate a new syntaxonomical system for Hungarian semi-dry grasslands. For this analysis 699 relevés were selected in which the percentage cover of at least one of the grasses Brachypodium pinnatum, Bromus erectus, Danthonia alpina, Avenula adsurgens, A. pubescens or A. compressa reached >10%. A geographical stratification of the dataset was performed and then it was split randomly into two equal parts (training and test datasets). Following outlier exclusion and noise elimination, clustering was performed separately for both datasets. The optimal number of clusters was determined by validation. The number of valid clusters was the highest at the level of ten clusters, where seven clusters appeared to be valid. The valid clusters are separated geographically; however, there are considerable overlaps in the species compositions. According to our results, all the grasslands belong to the Cirsio-Brachypodion alliance. The seven valid clusters are assigned to five main groups of semi-dry grasslands in Hungary: 1. Brachypodium pinnatum (and partly Bromus erectus) dominated, species rich meadow-steppe-like grasslands occurring on deep loess in central Pannonia, identified as Euphorbio pannonicae-Brachypodietum Horváth 2009; 2. Brachypodium pinnatum dominated mountain grasslands restricted to the Bükk Mountains; identified as Polygalo majoris-Brachypodietum Wagner 1941; 3. mostly Bromus erectus dominated grasslands on shallow, calcium/rich soils of the Dunántúl region, proposed as a new association Sanguisorbo minoris-Brometum erecti Illyés, Bauer & Botta-Dukát 2009; 4. Brachypodium pinnatum and Danthonia alpina dominated stands occurring mainly in the Északi-középhegység Mts, characterized by species of nutrient poor soils, proposed as a new association Trifolio medii-Brachypodietum pinnati Illyés, Bauer & Botta-Dukát 2009; 5. transition towards meadows and successional stands dominated mainly by Brachypodium pinnatum.
The cluster analysis and the formal concept analysis are both used to
identify significant groups of sirnilar objects. The Rice & Siff’s algorithm joins these two methods for a two-valued object-attribute (0-A) model and often significantly reduces the amount of concepts and the complexity. We consider an 0-A model with graded degrees of attributes. We define a new type of one-sided fuzzification of a conceptual lattice. We generalize the Rice & Siff’s algorithm for this case wrt a fixed rnetric. We prove the basic properties of this new lattice, metric and algorithm and discuss it on a real example.
The present study devises two techniques for visualizing biological sequence data clusterings. The Sequence Data Density Display (SDDD) and Sequence Likelihood Projection (SLP) visualizations represent the input symbolical sequences in a lower-dimensional space in such a way that the clusters and relations of data elements are preserved as faithfully as possible. The resulting unified framework incorporates directly raw symbolical sequence data (without necessitating any preprocessing stage), and moreover, operates on a pure unsupervised basis under complete absence of prior information and domain knowledge.
In this article we present a novel method for mobile phone positioning using a vector space model, suffix trees and an information retrieval approach. The algorithm is based on a database of previous measurements which are used as an index which looks for the nearest neighbor toward the query measurement. The accuracy of the algorithm is, in most cases, good enough to accomplish the E9-1-1 standards requirements on tested data. In addition, we are trying to look at the clusters of patterns that we have created from measured data and we have reflected them to the map. We use Self-Organizing Maps for these purposes.
Most of the traditional clustering algorithms are poor for clustering more complex structures other than the convex spherical sample space. In the past few years, several spectral clustering algorithms were proposed to cluster arbitrarily shaped data in various real applications. However, spectral clustering relies on the dataset where each cluster is approximately well separated to a certain extent. In the case that the cluster has an obvious inflection point within a non-convex space, the spectral clustering algorithm would mistakenly recognize one cluster to be different clusters. In this paper, we propose a novel spectral clustering algorithm called HSC combined with hierarchical method, which obviates the disadvantage of the spectral clustering by not using the misleading information of the noisy neighboring data points. The simple clustering procedure is applied to eliminate the misleading information, and thus the HSC algorithm could cluster both convex shaped data and arbitrarily shaped data more efficiently and accurately. The experiments on both synthetic data sets and real data sets show that HSC outperforms other popular clustering algorithms. Furthermore, we observed that HSC can also be used for the estimation of the number of clusters.
The cluster analysis and the fonnal concept analysis are both used to identity significiant groups of similar objects. Rice & Siff’s algorithm for the clustering joins these two methods in the case where the values of an object-attribute model are 1 or 0 and often reduce an amount of concepts. We use a certain type of fuzzification of a concept lattice for generalization of this clustering algorithm in the fuzzy case. For the purpose of finding dependencies between the objects in the clusters we use our method of the induction of generalized annotated programs based on multiple using of the crisp inductive logic programming. Since our model contains fuzzy data, it should have work with a fuzzy background knowledge and a fuzzy set of examples - which are not divided clearly into positive and negative classes, but there is a monotone hierarchy (degree, preference) of more or less positive / negative examples. We have made experiments on data describing business competitiveness of Slovak companies.
One of the features involved in clustering is the evaluation of distances between individuals. This paper is related with the use of different mixed metrics for clustering messy data. Indeed, in real complex domains it becomes natural to deal with both numerical and symbolic attributes. This can be treated on different approaches. Here, the use of mixed metrics is followed. In the paper, impact of metrics on final classes is studied. The application relates to clustering municipalities of the metropolitan area of Barcelona on the bases of their constructive behavior, the number of buildings of different types being constructed, or the politics orientation of the local government. Importance of the reporting phase is also faced in this work. Both clustering with several distances and the interpretation oriented tools are provided by a software specially designed to support Knowledge Discovery on real complex domains, called KLASS.