Comparison of Imputation Techniques for Efficient Prediction of Software Fault Proneness in Classes
Abstract:Missing data is a persistent problem in almost all
areas of empirical research. The missing data must be treated very
carefully, as data plays a fundamental role in every analysis.
Improper treatment can distort the analysis or generate biased results.
In this paper, we compare and contrast various imputation techniques
on missing data sets and make an empirical evaluation of these
methods so as to construct quality software models. Our empirical
study is based on NASA-s two public dataset. KC4 and KC1. The
actual data sets of 125 cases and 2107 cases respectively, without
any missing values were considered. The data set is used to create
Missing at Random (MAR) data Listwise Deletion(LD), Mean
Substitution(MS), Interpolation, Regression with an error term and
Expectation-Maximization (EM) approaches were used to compare
the effects of the various techniques.
 R.J.A Little, D.B. Rubin, Statistical Analysis with missing data, Wiley,
New York, 1987.
 D.B.Rubin, Multiple imputation for non response in surveys, Wiley, New
 J.Schafer, Analysis of incomplete multivariate data: Chapman and Hall,
 F.Harrell,"Regression modelling Strategies: With Applications to Linear
Models, Logistic Regression, and Survival Analysis" Springer, New
 P.D. Allison, Missing Data, SAGE Publication, Inc, 2001..
 C.M. Musil, C.B.Warner, P.K.Yobas, and S.L. Jones, "A Comparison of
Imputation Techniques for handling missing data," Western Journal of
Nursing Research, vol.24, no. 5,pp.815-829, 2002.
 E.G. Johnson, "Considerations and techniques for the analysis of NAEP
data," Journal of Educational Statistics, vol.14, pp.303-334,1989.
 C.J.Kaufman, "The application of logical imputation to household
measurement", Journal of the Market Research Society, vol.30, pp.453-
 I.Myrtveit, E. Stensrud, and U.Olsson, "Analyzing Data Sets with
missing Data: An Empirical Evaluation of Imputation Methods and
Likelihood-Based Methods," IEEE Transactions on Software
Engineering, vol.27, no.11, pp.1999-1013, 2001.
 K.Strike, K.E.El-Emam, N.Madhavji, "Software Cost Estimation with
Incomplete Data," IEEE Transactions on Software Engineering, vol.27,
no.10,890-908, 2001.R. W. Lucky, "Automatic equalization for digital
communication," Bell Syst. Tech. J., vol. 44, no. 4, pp. 547-588, Apr.
 M.Cartwright, M.J.Shepperd, and Q.Song, "Dealing with Missing
Software Project data," In Proc. of the 9th Int. Symp. on Software
Metrics, pp.154-165, 2003.
 B.Twala, M.Cartwright, M.J. Shepperd, "Ensemble of Missing Data
Techniques to Improve Software Prediction Accracy," ICSE-06, 2006.
 B.Twala, "An Empirical Comparison of Techniques for handling
Incomplete Data using Decision Trees," Journal of Applied Artificial
Intelligence, vol.23, no. 5, pp.373-405, 2009.
 www.mdp.ivv.nasa.gov, NASA Metrics data Repository.