We present a new Generalized Learning Vector Quantization classifier called Optimally Generalized Learning Vector Quantization based on a novel weight-update rule for learning labeled samples. The algorithm attains stable prototype/weight vector dynamics in terms of estimated current and previous weights and their updates. Resulting weight update term is then related to the proximity measure used by Generalized Learning Vector Quantization classifiers. New algorithm and some major counterparts are tested and compared for synthetic and publicly available datasets. For both the datasets studied, it is seen that the new classifier outperforms its counterparts in training and testing with accuracy above 80% its counterparts and in robustness against model parameter varition.
Time series forecasting, such as stock price prediction, is one of the most important complications in the financial area as data is unsteady and has noisy variables, which are affected by many factors. This study applies a hybrid method of Genetic Algorithm (GA) and Artificial Neural Network (ANN) technique to develop a method for predicting stock price and time series. In the GA method, the output values are further fed to a developed ANN algorithm to fix errors on exact point. The analysis suggests that the GA and ANN can increase the accuracy in fewer iterations. The analysis is conducted on the 200-day main index, as well as on five companies listed on the NASDAQ. By applying the proposed method to the Apple stocks dataset, based on a hybrid model of GA and Back Propagation (BP) algorithms, the proposed method reaches to 99.99% improvement in SSE and 90.66% in time improvement, in comparison to traditional methods. These results show the performances and the speed and the accuracy of the proposed approach.
An Electronic Performance Support System (EPSS) introduces challenges on contextualized and personalized information delivery. Recommender systems aim at delivering and suggesting relevant information according to users preferences, thus EPSSs could take advantage of the recommendation algorithms that have the effect of guiding users in a large space of possible options. The JUMP project (JUst-in-tiMe Performance support systém for dynamic organizations, co-funded by POR Puglia 2000-2006 - Mis. 3.13, Sostegno agli Investimenti in Ricerca Industriale, Sviluppo Precompetitivo e Trasferimento Tecnologico) aims at integrating an EPSS with a hybrid recommender system.
Collaborative and content-based filtering are the recommendation techniques most widely adopted to date. The main contribution of this paper is a content-collaborative hybrid recommender which computes similarities between users relying on their content-based profiles in which user preferences are stored, instead of comparing their rating styles. A distinctive feature of our systém is that a statistical model of the user interests is obtained by machine learning techniques integrated with linguistic knowledge contained in WordNet. This model, named ``semantic user profile'', is exploited by the hybrid recommender in the neighborhood formation process.
Automatic text classification is a very important task that consists in assigning labels (categories, groups, classes) to a given text based on a set of previously labeled texts called training set. The work presented in this paper treats the problem of automatic topical text categorization. It is a supervised classification because it works on a predefined set of classes and topical because it uses topics or subjects of texts as classes. In this context, we used a new approach based on $k$-NN algorithm, as well as a new set of pseudo-distances (distance metrics) known in the field of language identification. We also proposed a simple and effective method to improve the quality of performed categorization.
The focus of this paper is the application of the genetic programming
framework in the problem of knowledge discovery in databases, more precisely in the task of classification. Genetic programming possesses certain advantages that make it suitable for application in data mining, such as robustness of the algorithm or its convenient structure for rule generation to name a few. This study concentrates on one type of parallel genetic algorithms - cellular (diffusion) model. Emphasis is placed on the improvement of efficiency and scalability of the data mining algorithm, which could be achieved by integrating the algorithm with databases and employing a cellular framework. The cellular model of genetic programming that exploits SQL queries is implemented and applied to the classification task. The results achieve are presented and compared with other machine learning algorithms.
The event runoff coefficient (Rc) and the recession coefficient (tc) are of theoretical importance for understanding catchment response and of practical importance in hydrological design. We analyse 57 event periods in the period 2013 to 2015 in the 66 ha Austrian Hydrological Open Air Laboratory (HOAL), where the seven subcatchments are stratified by runoff generation types into wetlands, tile drainage and natural drainage. Three machine learning algorithms (Random forest (RF), Gradient Boost Decision Tree (GBDT) and Support vector machine (SVM)) are used to estimate Rc and tc from 22 event based explanatory variables representing precipitation, soil moisture, groundwater level and season. The model performance of the SVM algorithm in estimating Rc and tc is generally higher than that of the other two methods, measured by the coefficient of determination R2, and the performance for Rc is higher than that for tc. The relative importance of the explanatory variables for the predictions, assessed by a heatmap, suggests that Rc of the tile drainage systems is more strongly controlled by the weather conditions than by the catchment state, while the opposite is true for natural drainage systems. Overall, model performance strongly depends on the runoff generation type.
In NLP Centre, dividing text into sentences is currently done with
a tool which uses rule-based system. In order to make enough training
data for machine learning, annotators manually split the corpus of contemporary text
CBB.blog (1 million tokens) into sentences.
Each file contains one hundredth of the whole corpus and all data were
processed in parallel by two annotators.
The corpus was created from ten contemporary blogs:
hintzu.otaku.cz
modnipeklo.cz
bloc.cz
aleneprokopova.blogspot.com
blog.aktualne.cz
fuchsova.blog.onaidnes.cz
havlik.blog.idnes.cz
blog.aktualne.centrum.cz
klusak.blogspot.cz
myego.cz/welldone
The purpose of feature selection in machine learning is at least two-fold - saving measurement acquisition costs and reducing the negative effects of the curse of dimensionality with the aim to improve the accuracy of the models and the classification rate of classifiers with respect to previously unknown data. Yet it has been shown recently that the process of feature selection itself can be negatively affected by the very same curse of dimensionality - feature selection methods may easily over-fit or perform unstably. Such an outcome is unlikely to generalize well and the resulting recognition system may fail to deliver the expectable performance. In many tasks, it is therefore crucial to employ additional mechanisms of making the feature selection process more stable and resistant the curse of dimensionality effects. In this paper we discuss three different approaches to reducing this problem. We present an algorithmic extension applicable to various feature selection methods, capable of reducing excessive feature subset dependency not only on specific training data, but also on specific criterion function properties. Further, we discuss the concept of criteria ensembles, where various criteria vote about feature inclusion/removal and go on to provide a general definition of feature selection hybridization aimed at combining the advantages of dependent and independent criteria. The presented ideas are illustrated through examples and summarizing recommendations are given.
Breast cancer survival prediction can have
an extreme effect on selection of best treatment protocols. Many approaches such as statistical or machine learning models have been employed to predict
the survival prospects of patients, but newer algorithms such as deep learning can be tested with the
aim of improving the models and prediction accuracy. In this study, we used machine learning and deep
learning approaches to predict breast cancer survival in 4,902 patient records from the University of
Malaya Medical Centre Breast Cancer Registry. The
results indicated that the multilayer perceptron (MLP),
random forest (RF) and decision tree (DT) classifiers
could predict survivorship, respectively, with 88.2 %,
83.3 % and 82.5 % accuracy in the tested samples.
Support vector machine (SVM) came out to be lower
with 80.5 %. In this study, tumour size turned out to
be the most important feature for breast cancer survivability prediction. Both deep learning and machine learning methods produce desirable prediction
accuracy, but other factors such as parameter configurations and data transformations affect the accuracy of the predictive model.
Matrix factorization or factor analysis is an important task helpful in the analysis of high dimensional real world data. There are several well known methods and algorithms for factorization of real data but many application areas including information retrieval, pattern recognition and data mining require processing of binary rather than real data. Unfortunately, the methods used for real matrix factorization fail in the latter case. In this paper we introduce background and initial version of Genetic Algorithm for binary matrix factorization.