International Science Index


Comparison of Imputation Techniques for Efficient Prediction of Software Fault Proneness in Classes

Abstract:Missing data is a persistent problem in almost all areas of empirical research. The missing data must be treated very carefully, as data plays a fundamental role in every analysis. Improper treatment can distort the analysis or generate biased results. In this paper, we compare and contrast various imputation techniques on missing data sets and make an empirical evaluation of these methods so as to construct quality software models. Our empirical study is based on NASA-s two public dataset. KC4 and KC1. The actual data sets of 125 cases and 2107 cases respectively, without any missing values were considered. The data set is used to create Missing at Random (MAR) data Listwise Deletion(LD), Mean Substitution(MS), Interpolation, Regression with an error term and Expectation-Maximization (EM) approaches were used to compare the effects of the various techniques.
[1] R.J.A Little, D.B. Rubin, Statistical Analysis with missing data, Wiley, New York, 1987.
[2] D.B.Rubin, Multiple imputation for non response in surveys, Wiley, New York, 1987.
[3] J.Schafer, Analysis of incomplete multivariate data: Chapman and Hall, 1997.
[4] F.Harrell,"Regression modelling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis" Springer, New York, 2001.
[5] P.D. Allison, Missing Data, SAGE Publication, Inc, 2001..
[6] C.M. Musil, C.B.Warner, P.K.Yobas, and S.L. Jones, "A Comparison of Imputation Techniques for handling missing data," Western Journal of Nursing Research, vol.24, no. 5,pp.815-829, 2002.
[7] E.G. Johnson, "Considerations and techniques for the analysis of NAEP data," Journal of Educational Statistics, vol.14, pp.303-334,1989.
[8] C.J.Kaufman, "The application of logical imputation to household measurement", Journal of the Market Research Society, vol.30, pp.453- 466, 1989.
[9] I.Myrtveit, E. Stensrud, and U.Olsson, "Analyzing Data Sets with missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods," IEEE Transactions on Software Engineering, vol.27, no.11, pp.1999-1013, 2001.
[10] K.Strike, K.E.El-Emam, N.Madhavji, "Software Cost Estimation with Incomplete Data," IEEE Transactions on Software Engineering, vol.27, no.10,890-908, 2001.R. W. Lucky, "Automatic equalization for digital communication," Bell Syst. Tech. J., vol. 44, no. 4, pp. 547-588, Apr. 1965.
[11] M.Cartwright, M.J.Shepperd, and Q.Song, "Dealing with Missing Software Project data," In Proc. of the 9th Int. Symp. on Software Metrics, pp.154-165, 2003.
[12] B.Twala, M.Cartwright, M.J. Shepperd, "Ensemble of Missing Data Techniques to Improve Software Prediction Accracy," ICSE-06, 2006.
[13] B.Twala, "An Empirical Comparison of Techniques for handling Incomplete Data using Decision Trees," Journal of Applied Artificial Intelligence, vol.23, no. 5, pp.373-405, 2009.
[14], NASA Metrics data Repository.