Machine Learning Facing Behavioral Noise Problem in an Imbalanced Data Using One Side Behavioral Noise Reduction: Application to a Fraud Detection
Abstract:With the expansion of machine learning and data
mining in the context of Big Data analytics, the common
problem that affects data is class imbalance. It refers to an
imbalanced distribution of instances belonging to each class. This
problem is present in many real world applications such as fraud
detection, network intrusion detection, medical diagnostics, etc.
In these cases, data instances labeled negatively are significantly
more numerous than the instances labeled positively. When this
difference is too large, the learning system may face difficulty
when tackling this problem, since it is initially designed to
work in relatively balanced class distribution scenarios. Another
important problem, which usually accompanies these imbalanced
data, is the overlapping instances between the two classes. It is
commonly referred to as noise or overlapping data. In this article,
we propose an approach called: One Side Behavioral Noise
Reduction (OSBNR). This approach presents a way to deal with
the problem of class imbalance in the presence of a high noise
level. OSBNR is based on two steps. Firstly, a cluster analysis is
applied to groups similar instances from the minority class into
several behavior clusters. Secondly, we select and eliminate the
instances of the majority class, considered as behavioral noise,
which overlap with behavior clusters of the minority class. The
results of experiments carried out on a representative public
dataset confirm that the proposed approach is efficient for the
treatment of class imbalances in the presence of noise.
 N. Japkowicz and S. Stephen, “The class imbalance problem: A
systematic study,” Intelligent data analysis, vol. 6, no. 5, pp. 429–449,
 A. Ali, S. M. Shamsuddin, and A. Ralescu, “Classification with class
imbalance problem: a review,” Int. J. Advance Soft Compu. Appl, vol. 7,
no. 3, pp. 176–204, 2015.
 A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi,
“Calibrating probability with undersampling for unbalanced
classification,” in 2015 IEEE Symposium Series on Computational
Intelligence, pp. 159–166, IEEE, 2015.
 B. Krawczyk, “Learning from imbalanced data: open challenges and
future directions,” Progress in Artificial Intelligence, vol. 5, no. 4,
pp. 221–232, 2016.
 H. Alberto Fernández, L. Salvador García, G. Mikel, C. P. Ronaldo,
K. Bartosz, and H. Francisco, Learning from imbalanced data sets.
Springer International Publishing, 2018.
 C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Folleco,
“An empirical study of the classification performance of learners on
imbalanced and noisy software quality data,” Information Sciences,
vol. 259, pp. 571–595, 2014.
 Y. Sun, A. K. Wong, and M. S. Kamel, “Classification of imbalanced
data: A review,” International journal of pattern recognition and
artificial intelligence, vol. 23, no. 04, pp. 687–719, 2009.
 Y. Sui, M. Yu, H. Hong, and X. Pan, “Learning from imbalanced data: A
comparative study,” in International Symposium on Security and Privacy
in Social Networks and Big Data, pp. 264–274, Springer, 2019.
 W.-C. Lin, C.-F. Tsai, Y.-H. Hu, and J.-S. Jhang, “Clustering-based
undersampling in class-imbalanced data,” Information Sciences,
vol. 409, pp. 17–26, 2017.
 G. Douzas, F. Bacao, and F. Last, “Improving imbalanced learning
through a heuristic oversampling method based on k-means and smote,”
Information Sciences, vol. 465, pp. 1–20, 2018.
 Y. Sun, M. S. Kamel, A. K.Wong, and Y.Wang, “Cost-sensitive boosting
for classification of imbalanced data,” Pattern Recognition, vol. 40,
no. 12, pp. 3358–3378, 2007.
 M. Galar, A. Fernández, E. Barrenechea, and F. Herrera, “Eusboost:
Enhancing ensembles for highly imbalanced data-sets by evolutionary
undersampling,” Pattern recognition, vol. 46, no. 12, pp. 3460–3471,
 P. Hart, “The condensed nearest neighbor rule (corresp.),” IEEE
transactions on information theory, vol. 14, no. 3, pp. 515–516, 1968.
 I. Tomek, “Two modifications of cnn,” IEEE Transactions on Systems,
Man, and Cybernetics, vol. 7(2), pp. 679–772, 1976.
 D. L. Wilson, “Asymptotic properties of nearest neighbor rules using
edited data,” IEEE Transactions on Systems, Man, and Cybernetics,
no. 3, pp. 408–421, 1972.
 J. Laurikkala, “Improving identification of difficult small classes by
balancing class distribution,” in Conference on Artificial Intelligence in
Medicine in Europe, pp. 63–66, Springer, 2001.
 I. Tomek, “An experiment with the edited nearest-neighbor rule.,” 1976.
 M. Kubat and S. Matwin, “Addressing the curse of imbalanced training
sets: one-sided selection,” in Icml, vol. 97, pp. 179–186, Nashville, USA,
 H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
Transactions on knowledge and data engineering, vol. 21, no. 9,
pp. 1263–1284, 2009.
 M. Rahman and D. N. Davis, “Cluster based under-sampling for
unbalanced cardiovascular data,” in Proceedings of the World Congress
on Engineering, vol. 3, pp. 3–5, 2013.
 P. Vuttipittayamongkol and E. Elyan, “Neighbourhood-based
undersampling approach for handling imbalanced and overlapped
data,” Information Sciences, vol. 509, pp. 47–70, 2020.
 N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
synthetic minority over-sampling technique,” Journal of artificial
intelligence research, vol. 16, pp. 321–357, 2002.
 T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM
Sigkdd Explorations Newsletter, vol. 6, no. 1, pp. 40–49, 2004.
 H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive
synthetic sampling approach for imbalanced learning,” in 2008 IEEE
international joint conference on neural networks (IEEE world congress
on computational intelligence), pp. 1322–1328, IEEE, 2008.
 S. Hu, Y. Liang, L. Ma, and Y. He, “Msmote: Improving classification
performance when training data is imbalanced,” in 2009 second
international workshop on computer science and engineering, vol. 2,
pp. 13–17, IEEE, 2009.
 C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap,
“Safe-level-smote: Safe-level-synthetic minority over-sampling
technique for handling the class imbalanced problem,” in Pacific-Asia
conference on knowledge discovery and data mining, pp. 475–482,
 M. Koziarski, B. Krawczyk, and M. Wo´zniak, “Radial-based
oversampling for noisy imbalanced data classification,” Neurocomputing,
vol. 343, pp. 19–33, 2019.
 Y. Freund and R. E. Schapire, “Schapire R: Experiments with a new
boosting algorithm,” in In: Thirteenth International Conference on ML,
 M. V. Joshi, V. Kumar, and R. C. Agarwal, “Evaluating boosting
algorithms to classify rare classes: Comparison and improvements,”
in Proceedings 2001 IEEE International Conference on Data Mining,
pp. 257–264, IEEE, 2001.
 N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “Smoteboost:
Improving prediction of the minority class in boosting,” in European
conference on principles of data mining and knowledge discovery,
pp. 107–119, Springer, 2003.
 C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano,
“Rusboost: A hybrid approach to alleviating class imbalance,” IEEE
Transactions on Systems, Man, and Cybernetics-Part A: Systems and
Humans, vol. 40, no. 1, pp. 185–197, 2009.
 L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1,
pp. 5–32, 2001.
 S. Hido, H. Kashima, and Y. Takahashi, “Roughly balanced bagging for
imbalanced data,” Statistical Analysis and Data Mining: The ASA Data
Science Journal, vol. 2, no. 5-6, pp. 412–426, 2009.
 J. Błaszczy´nski and J. Stefanowski, “Neighbourhood sampling in
bagging for imbalanced data,” Neurocomputing, vol. 150, pp. 529–542,
 P. Domingos, “Metacost: A general method for making classifiers
cost-sensitive,” in Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining, pp. 155–164,
 W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan, “Adacost:
misclassification cost-sensitive boosting,” in Icml, vol. 99, pp. 97–105,
 Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networks with
methods addressing the class imbalance problem,” IEEE Transactions on
knowledge and data engineering, vol. 18, no. 1, pp. 63–77, 2005.
 P. Cao, D. Zhao, and O. Zaiane, “An optimized cost-sensitive svm
for imbalanced data learning,” in Pacific-Asia conference on knowledge
discovery and data mining, pp. 280–292, Springer, 2013.
 C. L. Castro and A. P. Braga, “Novel cost-sensitive approach to
improve the multilayer perceptron performance on imbalanced data,”
IEEE transactions on neural networks and learning systems, vol. 24,
no. 6, pp. 888–899, 2013.
 B. Krawczyk, M. Wo´zniak, and G. Schaefer, “Cost-sensitive decision
tree ensembles for effective imbalanced classification,” Applied Soft
Computing, vol. 14, pp. 554–562, 2014.
 X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study,”
Artificial intelligence review, vol. 22, no. 3, pp. 177–210, 2004.
 K. Alsabti, S. Ranka, and V. Singh, “An efficient k-means clustering
algorithm,” Electrical Engineering and Computer Science. 43. College
of Engineering and Computer Science at SURFACE, 1997.
 K. Inc, “Credit card fraud detection: Anonymized credit card
transactions labeled as fraudulent or genuine,” 2013.
 E. H. Salma, M. Jamal, B. Mohammed, and F. Bouziane, “Machine
learning for anomaly detection. performance study considering anomaly
distribution in an imbalanced dataset,” in 2020 5th International
Conference on Cloud Computing and Artificial Intelligence Technologies
and Applications (Cloudtech’20), IEEE, 2020.
 C. Priscilla and D. Prabha, “Credit card fraud detection: A systematic
review,” in International Conference on Information, Communication
and Computing Technology, pp. 290–303, Springer, 2019.