Data Mining Classification Methods Applied in Drug Design
Abstract:Data mining incorporates a group of statistical
methods used to analyze a set of information, or a data set. It operates
with models and algorithms, which are powerful tools with the great
potential. They can help people to understand the patterns in certain
chunk of information so it is obvious that the data mining tools have
a wide area of applications. For example in the theoretical chemistry
data mining tools can be used to predict moleculeproperties or
improve computer-assisted drug design. Classification analysis is one
of the major data mining methodologies. The aim of thecontribution
is to create a classification model, which would be able to deal with a
huge data set with high accuracy. For this purpose logistic regression,
Bayesian logistic regression and random forest models were built
using R software. TheBayesian logistic regression in Latent GOLD
software was created as well. These classification methods belong to
supervised learning methods.
It was necessary to reduce data matrix dimension before construct
models and thus the factor analysis (FA) was used. Those models
were applied to predict the biological activity of molecules, potential
new drug candidates.
 A.Hoeben, B.Landuyt, M. S. Highley, H.Wildiers, A. T. Van Oosterom,
and E. A. De Bruijn,"Vascular Endothelial Growth Factor and
Angiogenesis," Pharmacological Reviews, vol. 56 no. 4, pp. 549-580,
 Boh├í─ì A., Faculty of Natural Science, Comenius University in
Bratislava, firstname.lastname@example.org, private communication, 2009.
 DRAGON Professional verzion 5.5 2007,TALETE, srl.
 J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G.
Coleman, "ZINC--a free database of commercially available compounds
for virtual screening,"J. Chem. Inf. Model., 2012, accepted for
 W. H├ñrdle, and L. Simar,Applied Multivariate Statistical Analysis.New
York: Springer, Berlin, 2007.
 IBM SPSS Statistics, Help, Algorithms
[online]. On-line manual.
 J. K. Vermunt, and J. Magidson,Technical Guide for Latent GOLD 4.0:
Basic and Advanced
[online]. Statistical Innovations Inc., Belmont
 J. K.Vermunt, and J. Magidson,"Latent class cluster analysis,"J. A.
Hagenaars, A. L. McCutcheon (eds.). Applied Latent Class Analysis.
Cambridge : Cambridge University Press, pp. 89-106,2002.
 A. Liaw, and M. Wiener,"Classification and Regression by
randomForest," R News, vol. 2, no.3, pp. 18ÔÇö22,2002.
 A.Gelman, Y. S. Su, M.Yajima, J. Hill, M. G.Pittau, J. Kerman, and T.
Zheng, "arm: Data Analysis Using Regression and
Multilevel/Hierarchical Models," R package version 1.5-02.://CRAN.Rproject.
 L. Breiman, J.H. Friedman, R.A. Olshen, andC.J. Stone,Classification
and Regression Trees.Chapman and Hall,Wadsworth, Inc., New York,
 StatSoft, Inc. Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB:
 L. Breiman, " Random Forests,"in Machine Learning, vol. 45, pp. 5-32,
 Ho Tin Kam,"Random Decision Forest," in.Proc. of the 3rd Int-l Conf.
on Document Analysis and Recognition, Montreal, Canada, August 14-
18, pp. 278-282, 1995.
 T. Hastie, R.Tibshirani, and J. H. Friedman,The elements of statistical
learning: data mining, inference, and prediction. New York: Springer-
 A.Gelman, A.Jakulin, M. G.Pittau, and Y.S. Su, "A weakly informative
default prior distribution for logistic and other regression models," The
annals of Applied Statistics, vol. 2, no. 4, pp.1360-1383, 2008.