International Science Index


10011899

Machine Learning Facing Behavioral Noise Problem in an Imbalanced Data Using One Side Behavioral Noise Reduction: Application to a Fraud Detection

Abstract:With the expansion of machine learning and data mining in the context of Big Data analytics, the common problem that affects data is class imbalance. It refers to an imbalanced distribution of instances belonging to each class. This problem is present in many real world applications such as fraud detection, network intrusion detection, medical diagnostics, etc. In these cases, data instances labeled negatively are significantly more numerous than the instances labeled positively. When this difference is too large, the learning system may face difficulty when tackling this problem, since it is initially designed to work in relatively balanced class distribution scenarios. Another important problem, which usually accompanies these imbalanced data, is the overlapping instances between the two classes. It is commonly referred to as noise or overlapping data. In this article, we propose an approach called: One Side Behavioral Noise Reduction (OSBNR). This approach presents a way to deal with the problem of class imbalance in the presence of a high noise level. OSBNR is based on two steps. Firstly, a cluster analysis is applied to groups similar instances from the minority class into several behavior clusters. Secondly, we select and eliminate the instances of the majority class, considered as behavioral noise, which overlap with behavior clusters of the minority class. The results of experiments carried out on a representative public dataset confirm that the proposed approach is efficient for the treatment of class imbalances in the presence of noise.
References:
[1] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent data analysis, vol. 6, no. 5, pp. 429–449, 2002.
[2] A. Ali, S. M. Shamsuddin, and A. Ralescu, “Classification with class imbalance problem: a review,” Int. J. Advance Soft Compu. Appl, vol. 7, no. 3, pp. 176–204, 2015.
[3] A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi, “Calibrating probability with undersampling for unbalanced classification,” in 2015 IEEE Symposium Series on Computational Intelligence, pp. 159–166, IEEE, 2015.
[4] B. Krawczyk, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016.
[5] H. Alberto Fernández, L. Salvador García, G. Mikel, C. P. Ronaldo, K. Bartosz, and H. Francisco, Learning from imbalanced data sets. Springer International Publishing, 2018.
[6] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Folleco, “An empirical study of the classification performance of learners on imbalanced and noisy software quality data,” Information Sciences, vol. 259, pp. 571–595, 2014.
[7] Y. Sun, A. K. Wong, and M. S. Kamel, “Classification of imbalanced data: A review,” International journal of pattern recognition and artificial intelligence, vol. 23, no. 04, pp. 687–719, 2009.
[8] Y. Sui, M. Yu, H. Hong, and X. Pan, “Learning from imbalanced data: A comparative study,” in International Symposium on Security and Privacy in Social Networks and Big Data, pp. 264–274, Springer, 2019.
[9] W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, “Clustering-based undersampling in class-imbalanced data,” Information Sciences, vol. 409, pp. 17–26, 2017.
[10] G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning through a heuristic oversampling method based on k-means and smote,” Information Sciences, vol. 465, pp. 1–20, 2018.
[11] Y. Sun, M. S. Kamel, A. K.Wong, and Y.Wang, “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognition, vol. 40, no. 12, pp. 3358–3378, 2007.
[12] M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, “Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling,” Pattern recognition, vol. 46, no. 12, pp. 3460–3471, 2013.
[13] P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE transactions on information theory, vol. 14, no. 3, pp. 515–516, 1968.
[14] I. Tomek, “Two modifications of cnn,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 7(2), pp. 679–772, 1976.
[15] D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data,” IEEE Transactions on Systems, Man, and Cybernetics, no. 3, pp. 408–421, 1972.
[16] J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” in Conference on Artificial Intelligence in Medicine in Europe, pp. 63–66, Springer, 2001.
[17] I. Tomek, “An experiment with the edited nearest-neighbor rule.,” 1976.
[18] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: one-sided selection,” in Icml, vol. 97, pp. 179–186, Nashville, USA, 1997.
[19] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 1263–1284, 2009.
[20] M. Rahman and D. N. Davis, “Cluster based under-sampling for unbalanced cardiovascular data,” in Proceedings of the World Congress on Engineering, vol. 3, pp. 3–5, 2013.
[21] P. Vuttipittayamongkol and E. Elyan, “Neighbourhood-based undersampling approach for handling imbalanced and overlapped data,” Information Sciences, vol. 509, pp. 47–70, 2020.
[22] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[23] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 40–49, 2004.
[24] H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning,” in 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pp. 1322–1328, IEEE, 2008.
[25] S. Hu, Y. Liang, L. Ma, and Y. He, “Msmote: Improving classification performance when training data is imbalanced,” in 2009 second international workshop on computer science and engineering, vol. 2, pp. 13–17, IEEE, 2009.
[26] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 475–482, Springer, 2009.
[27] M. Koziarski, B. Krawczyk, and M. Wo´zniak, “Radial-based oversampling for noisy imbalanced data classification,” Neurocomputing, vol. 343, pp. 19–33, 2019.
[28] Y. Freund and R. E. Schapire, “Schapire R: Experiments with a new boosting algorithm,” in In: Thirteenth International Conference on ML, Citeseer, 1996.
[29] M. V. Joshi, V. Kumar, and R. C. Agarwal, “Evaluating boosting algorithms to classify rare classes: Comparison and improvements,” in Proceedings 2001 IEEE International Conference on Data Mining, pp. 257–264, IEEE, 2001.
[30] N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost: Improving prediction of the minority class in boosting,” in European conference on principles of data mining and knowledge discovery, pp. 107–119, Springer, 2003.
[31] C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “Rusboost: A hybrid approach to alleviating class imbalance,” IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 40, no. 1, pp. 185–197, 2009.
[32] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[33] S. Hido, H. Kashima, and Y. Takahashi, “Roughly balanced bagging for imbalanced data,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 2, no. 5-6, pp. 412–426, 2009.
[34] J. Błaszczy´nski and J. Stefanowski, “Neighbourhood sampling in bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542, 2015.
[35] P. Domingos, “Metacost: A general method for making classifiers cost-sensitive,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 155–164, 1999.
[36] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, “Adacost: misclassification cost-sensitive boosting,” in Icml, vol. 99, pp. 97–105, 1999.
[37] Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks with methods addressing the class imbalance problem,” IEEE Transactions on knowledge and data engineering, vol. 18, no. 1, pp. 63–77, 2005.
[38] P. Cao, D. Zhao, and O. Zaiane, “An optimized cost-sensitive svm for imbalanced data learning,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 280–292, Springer, 2013.
[39] C. L. Castro and A. P. Braga, “Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data,” IEEE transactions on neural networks and learning systems, vol. 24, no. 6, pp. 888–899, 2013.
[40] B. Krawczyk, M. Wo´zniak, and G. Schaefer, “Cost-sensitive decision tree ensembles for effective imbalanced classification,” Applied Soft Computing, vol. 14, pp. 554–562, 2014.
[41] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,” Artificial intelligence review, vol. 22, no. 3, pp. 177–210, 2004.
[42] K. Alsabti, S. Ranka, and V. Singh, “An efficient k-means clustering algorithm,” Electrical Engineering and Computer Science. 43. College of Engineering and Computer Science at SURFACE, 1997.
[43] K. Inc, “Credit card fraud detection: Anonymized credit card transactions labeled as fraudulent or genuine,” 2013.
[44] E. H. Salma, M. Jamal, B. Mohammed, and F. Bouziane, “Machine learning for anomaly detection. performance study considering anomaly distribution in an imbalanced dataset,” in 2020 5th International Conference on Cloud Computing and Artificial Intelligence Technologies and Applications (Cloudtech’20), IEEE, 2020.
[45] C. Priscilla and D. Prabha, “Credit card fraud detection: A systematic review,” in International Conference on Information, Communication and Computing Technology, pp. 290–303, Springer, 2019.