International Science Index
Improving Fake News Detection Using K-means and Support Vector Machine Approaches
Fake news and false information are big challenges of all types of media, especially social media. There is a lot of false information, fake likes, views and duplicated accounts as big social networks such as Facebook and Twitter admitted. Most information appearing on social media is doubtful and in some cases misleading. They need to be detected as soon as possible to avoid a negative impact on society. The dimensions of the fake news datasets are growing rapidly, so to obtain a better result of detecting false information with less computation time and complexity, the dimensions need to be reduced. One of the best techniques of reducing data size is using feature selection method. The aim of this technique is to choose a feature subset from the original set to improve the classification performance. In this paper, a feature selection method is proposed with the integration of K-means clustering and Support Vector Machine (SVM) approaches which work in four steps. First, the similarities between all features are calculated. Then, features are divided into several clusters. Next, the final feature set is selected from all clusters, and finally, fake news is classified based on the final feature subset using the SVM method. The proposed method was evaluated by comparing its performance with other state-of-the-art methods on several specific benchmark datasets and the outcome showed a better classification of false information for our work. The detection performance was improved in two aspects. On the one hand, the detection runtime process decreased, and on the other hand, the classification accuracy increased because of the elimination of redundant features and the reduction of datasets dimensions.
A Hybrid Feature Selection and Deep Learning Algorithm for Cancer Disease Classification
Learning from very big datasets is a significant problem for most present data mining and machine learning algorithms. MicroRNA (miRNA) is one of the important big genomic and non-coding datasets presenting the genome sequences. In this paper, a hybrid method for the classification of the miRNA data is proposed. Due to the variety of cancers and high number of genes, analyzing the miRNA dataset has been a challenging problem for researchers. The number of features corresponding to the number of samples is high and the data suffer from being imbalanced. The feature selection method has been used to select features having more ability to distinguish classes and eliminating obscures features. Afterward, a Convolutional Neural Network (CNN) classifier for classification of cancer types is utilized, which employs a Genetic Algorithm to highlight optimized hyper-parameters of CNN. In order to make the process of classification by CNN faster, Graphics Processing Unit (GPU) is recommended for calculating the mathematic equation in a parallel way. The proposed method is tested on a real-world dataset with 8,129 patients, 29 different types of tumors, and 1,046 miRNA biomarkers, taken from The Cancer Genome Atlas (TCGA) database.
From Electroencephalogram to Epileptic Seizures Detection by Using Artificial Neural Networks
Seizure is the main factor that affects the quality of life of epileptic patients. The diagnosis of epilepsy, and hence the identification of epileptogenic zone, is commonly made by using continuous Electroencephalogram (EEG) signal monitoring. Seizure identification on EEG signals is made manually by epileptologists and this process is usually very long and error prone. The aim of this paper is to describe an automated method able to detect seizures in EEG signals, using knowledge discovery in database process and data mining methods and algorithms, which can support physicians during the seizure detection process. Our detection method is based on Artificial Neural Network classifier, trained by applying the multilayer perceptron algorithm, and by using a software application, called Training Builder that has been developed for the massive extraction of features from EEG signals. This tool is able to cover all the data preparation steps ranging from signal processing to data analysis techniques, including the sliding window paradigm, the dimensionality reduction algorithms, information theory, and feature selection measures. The final model shows excellent performances, reaching an accuracy of over 99% during tests on data of a single patient retrieved from a publicly available EEG dataset.
River Stage-Discharge Forecasting Based on Multiple-Gauge Strategy Using EEMD-DWT-LSSVM Approach
This study presented hybrid pre-processing approach along with a conceptual model to enhance the accuracy of river discharge prediction. In order to achieve this goal, Ensemble Empirical Mode Decomposition algorithm (EEMD), Discrete Wavelet Transform (DWT) and Mutual Information (MI) were employed as a hybrid pre-processing approach conjugated to Least Square Support Vector Machine (LSSVM). A conceptual strategy namely multi-station model was developed to forecast the Souris River discharge more accurately. The strategy used herein was capable of covering uncertainties and complexities of river discharge modeling. DWT and EEMD was coupled, and the feature selection was performed for decomposed sub-series using MI to be employed in multi-station model. In the proposed feature selection method, some useless sub-series were omitted to achieve better performance. Results approved efficiency of the proposed DWT-EEMD-MI approach to improve accuracy of multi-station modeling strategies.
Implementation of a Multimodal Biometrics Recognition System with Combined Palm Print and Iris Features
With extensive application, the performance of unimodal biometrics systems has to face a diversity of problems such as signal and background noise, distortion, and environment differences. Therefore, multimodal biometric systems are proposed to solve the above stated problems. This paper introduces a bimodal biometric recognition system based on the extracted features of the human palm print and iris. Palm print biometric is fairly a new evolving technology that is used to identify people by their palm features. The iris is a strong competitor together with face and fingerprints for presence in multimodal recognition systems. In this research, we introduced an algorithm to the combination of the palm and iris-extracted features using a texture-based descriptor, the Scale Invariant Feature Transform (SIFT). Since the feature sets are non-homogeneous as features of different biometric modalities are used, these features will be concatenated to form a single feature vector. Particle swarm optimization (PSO) is used as a feature selection technique to reduce the dimensionality of the feature. The proposed algorithm will be applied to the Institute of Technology of Delhi (IITD) database and its performance will be compared with various iris recognition algorithms found in the literature.
Multi-Level Air Quality Classification in China Using Information Gain and Support Vector Machine
Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.
A Fuzzy-Rough Feature Selection Based on Binary Shuffled Frog Leaping Algorithm
Feature selection and attribute reduction are crucial
problems, and widely used techniques in the field of machine
learning, data mining and pattern recognition to overcome the
well-known phenomenon of the Curse of Dimensionality. This paper
presents a feature selection method that efficiently carries out attribute
reduction, thereby selecting the most informative features of a dataset.
It consists of two components: 1) a measure for feature subset
evaluation, and 2) a search strategy. For the evaluation measure,
we have employed the fuzzy-rough dependency degree (FRFDD)
of the lower approximation-based fuzzy-rough feature selection
(L-FRFS) due to its effectiveness in feature selection. As for the
search strategy, a modified version of a binary shuffled frog leaping
algorithm is proposed (B-SFLA). The proposed feature selection
method is obtained by hybridizing the B-SFLA with the FRDD. Nine
classifiers have been employed to compare the proposed approach
with several existing methods over twenty two datasets, including
nine high dimensional and large ones, from the UCI repository.
The experimental results demonstrate that the B-SFLA approach
significantly outperforms other metaheuristic methods in terms of the
number of selected features and the classification accuracy.
Hybrid Anomaly Detection Using Decision Tree and Support Vector Machine
Intrusion detection systems (IDS) are the main components of network security. These systems analyze the network events for intrusion detection. The design of an IDS is through the training of normal traffic data or attack. The methods of machine learning are the best ways to design IDSs. In the method presented in this article, the pruning algorithm of C5.0 decision tree is being used to reduce the features of traffic data used and training IDS by the least square vector algorithm (LS-SVM). Then, the remaining features are arranged according to the predictor importance criterion. The least important features are eliminated in the order. The remaining features of this stage, which have created the highest level of accuracy in LS-SVM, are selected as the final features. The features obtained, compared to other similar articles which have examined the selected features in the least squared support vector machine model, are better in the accuracy, true positive rate, and false positive. The results are tested by the UNSW-NB15 dataset.
Optimized Preprocessing for Accurate and Efficient Bioassay Prediction with Machine Learning Algorithms
Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for improving the bioassay predictions. The first step is instance selection in which dataset is categorized into training, testing, and validation sets. The second step is discretization that partitions the data in consideration of accuracy vs. precision. The third step is normalization where data are normalized between 0 and 1 for subsequent machine learning processing. The fourth step is feature selection where key chemical properties and attributes are generated. The streamlined results are then analyzed for the prediction of effectiveness by various machine learning algorithms including Pipeline Pilot, R, Weka, and Excel. Experiments and evaluations reveal the effectiveness of various combination of preprocessing steps and machine learning algorithms in more consistent and accurate prediction.
A Psychophysiological Evaluation of an Effective Recognition Technique Using Interactive Dynamic Virtual Environments
Recording psychological and physiological correlates of human performance within virtual environments and interpreting their impacts on human engagement, ‘immersion’ and related emotional or ‘effective’ states is both academically and technologically challenging. By exposing participants to an effective, real-time (game-like) virtual environment, designed and evaluated in an earlier study, a psychophysiological database containing the EEG, GSR and Heart Rate of 30 male and female gamers, exposed to 10 games, was constructed. Some 174 features were subsequently identified and extracted from a number of windows, with 28 different timing lengths (e.g. 2, 3, 5, etc. seconds). After reducing the number of features to 30, using a feature selection technique, K-Nearest Neighbour (KNN) and Support Vector Machine (SVM) methods were subsequently employed for the classification process. The classifiers categorised the psychophysiological database into four effective clusters (defined based on a 3-dimensional space – valence, arousal and dominance) and eight emotion labels (relaxed, content, happy, excited, angry, afraid, sad, and bored). The KNN and SVM classifiers achieved average cross-validation accuracies of 97.01% (±1.3%) and 92.84% (±3.67%), respectively. However, no significant differences were found in the classification process based on effective clusters or emotion labels.
Feature Selection and Predictive Modeling of Housing Data Using Random Forest
Predictive data analysis and modeling involving machine learning techniques become challenging in presence of too many explanatory variables or features. Presence of too many features in machine learning is known to not only cause algorithms to slow down, but they can also lead to decrease in model prediction accuracy. This study involves housing dataset with 79 quantitative and qualitative features that describe various aspects people consider while buying a new house. Boruta algorithm that supports feature selection using a wrapper approach build around random forest is used in this study. This feature selection process leads to 49 confirmed features which are then used for developing predictive random forest models. The study also explores five different data partitioning ratios and their impact on model accuracy are captured using coefficient of determination (r-square) and root mean square error (rsme).
Data Quality Enhancement with String Length Distribution
Recently, collectable manufacturing data are rapidly
increasing. On the other hand, mega recall is getting serious as
a social problem. Under such circumstances, there are increasing
needs for preventing mega recalls by defect analysis such as
root cause analysis and abnormal detection utilizing manufacturing
data. However, the time to classify strings in manufacturing data
by traditional method is too long to meet requirement of quick
defect analysis. Therefore, we present String Length Distribution
Classification method (SLDC) to correctly classify strings in a short
time. This method learns character features, especially string length
distribution from Product ID, Machine ID in BOM and asset list.
By applying the proposal to strings in actual manufacturing data, we
verified that the classification time of strings can be reduced by 80%.
As a result, it can be estimated that the requirement of quick defect
analysis can be fulfilled.
sEMG Interface Design for Locomotion Identification
Surface electromyographic (sEMG) signal has the potential to identify the human activities and intention. This potential is further exploited to control the artificial limbs using the sEMG signal from residual limbs of amputees. The paper deals with the development of multichannel cost efficient sEMG signal interface for research application, along with evaluation of proposed class dependent statistical approach of the feature selection method. The sEMG signal acquisition interface was developed using ADS1298 of Texas Instruments, which is a front-end interface integrated circuit for ECG application. Further, the sEMG signal is recorded from two lower limb muscles for three locomotions namely: Plane Walk (PW), Stair Ascending (SA), Stair Descending (SD). A class dependent statistical approach is proposed for feature selection and also its performance is compared with 12 preexisting feature vectors. To make the study more extensive, performance of five different types of classifiers are compared. The outcome of the current piece of work proves the suitability of the proposed feature selection algorithm for locomotion recognition, as compared to other existing feature vectors. The SVM Classifier is found as the outperformed classifier among compared classifiers with an average recognition accuracy of 97.40%. Feature vector selection emerges as the most dominant factor affecting the classification performance as it holds 51.51% of the total variance in classification accuracy. The results demonstrate the potentials of the developed sEMG signal acquisition interface along with the proposed feature selection algorithm.
CBIR Using Multi-Resolution Transform for Brain Tumour Detection and Stages Identification
Image retrieval is the most interesting technique which is being used today in our digital world. CBIR, commonly expanded as Content Based Image Retrieval is an image processing technique which identifies the relevant images and retrieves them based on the patterns that are extracted from the digital images. In this paper, two research works have been presented using CBIR. The first work provides an automated and interactive approach to the analysis of CBIR techniques. CBIR works on the principle of supervised machine learning which involves feature selection followed by training and testing phase applied on a classifier in order to perform prediction. By using feature extraction, the image transforms such as Contourlet, Ridgelet and Shearlet could be utilized to retrieve the texture features from the images. The features extracted are used to train and build a classifier using the classification algorithms such as Naïve Bayes, K-Nearest Neighbour and Multi-class Support Vector Machine. Further the testing phase involves prediction which predicts the new input image using the trained classifier and label them from one of the four classes namely 1- Normal brain, 2- Benign tumour, 3- Malignant tumour and 4- Severe tumour. The second research work includes developing a tool which is used for tumour stage identification using the best feature extraction and classifier identified from the first work. Finally, the tool will be used to predict tumour stage and provide suggestions based on the stage of tumour identified by the system. This paper presents these two approaches which is a contribution to the medical field for giving better retrieval performance and for tumour stages identification.
A Static Android Malware Detection Based on Actual Used Permissions Combination and API Calls
Android operating system has been recognized by most application developers because of its good open-source and compatibility, which enriches the categories of applications greatly. However, it has become the target of malware attackers due to the lack of strict security supervision mechanisms, which leads to the rapid growth of malware, thus bringing serious safety hazards to users. Therefore, it is critical to detect Android malware effectively. Generally, the permissions declared in the AndroidManifest.xml can reflect the function and behavior of the application to a large extent. Since current Android system has not any restrictions to the number of permissions that an application can request, developers tend to apply more than actually needed permissions in order to ensure the successful running of the application, which results in the abuse of permissions. However, some traditional detection methods only consider the requested permissions and ignore whether it is actually used, which leads to incorrect identification of some malwares. Therefore, a machine learning detection method based on the actually used permissions combination and API calls was put forward in this paper. Meanwhile, several experiments are conducted to evaluate our methodology. The result shows that it can detect unknown malware effectively with higher true positive rate and accuracy while maintaining a low false positive rate. Consequently, the AdaboostM1 (J48) classification algorithm based on information gain feature selection algorithm has the best detection result, which can achieve an accuracy of 99.8%, a true positive rate of 99.6% and a lowest false positive rate of 0.
From Type-I to Type-II Fuzzy System Modeling for Diagnosis of Hepatitis
Hepatitis is one of the most common and dangerous diseases that affects humankind, and exposes millions of people to serious health risks every year. Diagnosis of Hepatitis has always been a challenge for physicians. This paper presents an effective method for diagnosis of hepatitis based on interval Type-II fuzzy. This proposed system includes three steps: pre-processing (feature selection), Type-I and Type-II fuzzy classification, and system evaluation. KNN-FD feature selection is used as the preprocessing step in order to exclude irrelevant features and to improve classification performance and efficiency in generating the classification model. In the fuzzy classification step, an “indirect approach” is used for fuzzy system modeling by implementing the exponential compactness and separation index for determining the number of rules in the fuzzy clustering approach. Therefore, we first proposed a Type-I fuzzy system that had an accuracy of approximately 90.9%. In the proposed system, the process of diagnosis faces vagueness and uncertainty in the final decision. Thus, the imprecise knowledge was managed by using interval Type-II fuzzy logic. The results that were obtained show that interval Type-II fuzzy has the ability to diagnose hepatitis with an average accuracy of 93.94%. The classification accuracy obtained is the highest one reached thus far. The aforementioned rate of accuracy demonstrates that the Type-II fuzzy system has a better performance in comparison to Type-I and indicates a higher capability of Type-II fuzzy system for modeling uncertainty.
Automatic Threshold Search for Heat Map Based Feature Selection: A Cancer Dataset Analysis
Public health is one of the most critical issues today;
therefore, there is great interest to improve technologies in the area
of diseases detection. With machine learning and feature selection,
it has been possible to aid the diagnosis of several diseases such
as cancer. In this work, we present an extension to the Heat Map
Based Feature Selection algorithm, this modification allows automatic
threshold parameter selection that helps to improve the generalization
performance of high dimensional data such as mass spectrometry.
We have performed a comparison analysis using multiple cancer
datasets and compare against the well known Recursive Feature
Elimination algorithm and our original proposal, the results show
improved classification performance that is very competitive against
Multi-Objective Evolutionary Computation Based Feature Selection Applied to Behaviour Assessment of Children
Abstract—Attribute or feature selection is one of the basic
strategies to improve the performances of data classification tasks,
and, at the same time, to reduce the complexity of classifiers,
and it is a particularly fundamental one when the number
of attributes is relatively high. Its application to unsupervised
classification is restricted to a limited number of experiments in
the literature. Evolutionary computation has already proven itself
to be a very effective choice to consistently reduce the number
of attributes towards a better classification rate and a simpler
semantic interpretation of the inferred classifiers. We present a feature
selection wrapper model composed by a multi-objective evolutionary
algorithm, the clustering method Expectation-Maximization (EM),
and the classifier C4.5 for the unsupervised classification of data
extracted from a psychological test named BASC-II (Behavior
Assessment System for Children - II ed.) with two objectives:
Maximizing the likelihood of the clustering model and maximizing
the accuracy of the obtained classifier. We present a methodology
to integrate feature selection for unsupervised classification, model
evaluation, decision making (to choose the most satisfactory model
according to a a posteriori process in a multi-objective context), and
testing. We compare the performance of the classifier obtained by the
multi-objective evolutionary algorithms ENORA and NSGA-II, and
the best solution is then validated by the psychologists that collected
Multiclass Support Vector Machines with Simultaneous Multi-Factors Optimization for Corporate Credit Ratings
Corporate credit rating prediction is one of the most important topics, which has been studied by researchers in the last decade. Over the last decade, researchers are pushing the limit to enhance the exactness of the corporate credit rating prediction model by applying several data-driven tools including statistical and artificial intelligence methods. Among them, multiclass support vector machine (MSVM) has been widely applied due to its good predictability. However, heuristics, for example, parameters of a kernel function, appropriate feature and instance subset, has become the main reason for the critics on MSVM, as they have dictate the MSVM architectural variables. This study presents a hybrid MSVM model that is intended to optimize all the parameter such as feature selection, instance selection, and kernel parameter. Our model adopts genetic algorithm (GA) to simultaneously optimize multiple heterogeneous design factors of MSVM.
Cost Sensitive Feature Selection in Decision-Theoretic Rough Set Models for Customer Churn Prediction: The Case of Telecommunication Sector Customers
In recent days, there is a change and the ongoing development of the telecommunications sector in the global market. In this sector, churn analysis techniques are commonly used for analysing why some customers terminate their service subscriptions prematurely. In addition, customer churn is utmost significant in this sector since it causes to important business loss. Many companies make various researches in order to prevent losses while increasing customer loyalty. Although a large quantity of accumulated data is available in this sector, their usefulness is limited by data quality and relevance. In this paper, a cost-sensitive feature selection framework is developed aiming to obtain the feature reducts to predict customer churn. The framework is a cost based optional pre-processing stage to remove redundant features for churn management. In addition, this cost-based feature selection algorithm is applied in a telecommunication company in Turkey and the results obtained with this algorithm.
A Hybrid Gene Selection Technique Using Improved Mutual Information and Fisher Score for Cancer Classification Using Microarrays
Feature Selection is significant in order to perform constructive classification in the area of cancer diagnosis. However, a large number of features compared to the number of samples makes the task of classification computationally very hard and prone to errors in microarray gene expression datasets. In this paper, we present an innovative method for selecting highly informative gene subsets of gene expression data that effectively classifies the cancer data into tumorous and non-tumorous. The hybrid gene selection technique comprises of combined Mutual Information and Fisher score to select informative genes. The gene selection is validated by classification using Support Vector Machine (SVM) which is a supervised learning algorithm capable of solving complex classification problems. The results obtained from improved Mutual Information and F-Score with SVM as a classifier has produced efficient results.
Automatic Staging and Subtype Determination for Non-Small Cell Lung Carcinoma Using PET Image Texture Analysis
In this study, our goal was to perform tumor staging and subtype determination automatically using different texture analysis approaches for a very common cancer type, i.e., non-small cell lung carcinoma (NSCLC). Especially, we introduced a texture analysis approach, called Law’s texture filter, to be used in this context for the first time. The 18F-FDG PET images of 42 patients with NSCLC were evaluated. The number of patients for each tumor stage, i.e., I-II, III or IV, was 14. The patients had ~45% adenocarcinoma (ADC) and ~55% squamous cell carcinoma (SqCCs). MATLAB technical computing language was employed in the extraction of 51 features by using first order statistics (FOS), gray-level co-occurrence matrix (GLCM), gray-level run-length matrix (GLRLM), and Laws’ texture filters. The feature selection method employed was the sequential forward selection (SFS). Selected textural features were used in the automatic classification by k-nearest neighbors (k-NN) and support vector machines (SVM). In the automatic classification of tumor stage, the accuracy was approximately 59.5% with k-NN classifier (k=3) and 69% with SVM (with one versus one paradigm), using 5 features. In the automatic classification of tumor subtype, the accuracy was around 92.7% with SVM one vs. one. Texture analysis of FDG-PET images might be used, in addition to metabolic parameters as an objective tool to assess tumor histopathological characteristics and in automatic classification of tumor stage and subtype.
One-Class Support Vector Machine for Sentiment Analysis of Movie Review Documents
Sentiment analysis means to classify a given review
document into positive or negative polar document. Sentiment
analysis research has been increased tremendously in recent times
due to its large number of applications in the industry and academia.
Sentiment analysis models can be used to determine the opinion of
the user towards any entity or product. E-commerce companies can
use sentiment analysis model to improve their products on the basis
of users’ opinion. In this paper, we propose a new One-class Support
Vector Machine (One-class SVM) based sentiment analysis model
for movie review documents. In the proposed approach, we initially
extract features from one class of documents, and further test the
given documents with the one-class SVM model if a given new test
document lies in the model or it is an outlier. Experimental results
show the effectiveness of the proposed sentiment analysis model.
Fuzzy Population-Based Meta-Heuristic Approaches for Attribute Reduction in Rough Set Theory
One of the global combinatorial optimization
problems in machine learning is feature selection. It concerned with
removing the irrelevant, noisy, and redundant data, along with
keeping the original meaning of the original data. Attribute reduction
in rough set theory is an important feature selection method. Since
attribute reduction is an NP-hard problem, it is necessary to
investigate fast and effective approximate algorithms. In this paper,
we proposed two feature selection mechanisms based on memetic
algorithms (MAs) which combine the genetic algorithm with a fuzzy
record to record travel algorithm and a fuzzy controlled great deluge
algorithm, to identify a good balance between local search and
genetic search. In order to verify the proposed approaches, numerical
experiments are carried out on thirteen datasets. The results show that
the MAs approaches are efficient in solving attribute reduction
problems when compared with other meta-heuristic approaches.
Calcification Classification in Mammograms Using Decision Trees
Cancer affects people globally with breast cancer being a leading killer. Breast cancer is due to the uncontrollable multiplication of cells resulting in a tumour or neoplasm. Tumours are called ‘benign’ when cancerous cells do not ravage other body tissues and ‘malignant’ if they do so. As mammography is an effective breast cancer detection tool at an early stage which is the most treatable stage it is the primary imaging modality for screening and diagnosis of this cancer type. This paper presents an automatic mammogram classification technique using wavelet and Gabor filter. Correlation feature selection is used to reduce the feature set and selected features are classified using different decision trees.
A New DIDS Design Based on a Combination Feature Selection Approach
Feature selection has been used in many fields such as
classification, data mining and object recognition and proven to be
effective for removing irrelevant and redundant features from the
original dataset. In this paper, a new design of distributed intrusion
detection system using a combination feature selection model based
on bees and decision tree. Bees algorithm is used as the search
strategy to find the optimal subset of features, whereas decision tree
is used as a judgment for the selected features. Both the produced
features and the generated rules are used by Decision Making Mobile
Agent to decide whether there is an attack or not in the networks.
Decision Making Mobile Agent will migrate through the networks,
moving from node to another, if it found that there is an attack on one
of the nodes, it then alerts the user through User Interface Agent or
takes some action through Action Mobile Agent. The KDD Cup 99
dataset is used to test the effectiveness of the proposed system. The
results show that even if only four features are used, the proposed
system gives a better performance when it is compared with the
obtained results using all 41 features.
A New Internal Architecture Based on Feature Selection for Holonic Manufacturing System
This paper suggests a new internal architecture of
holon based on feature selection model using the combination of
Bees Algorithm (BA) and Artificial Neural Network (ANN). BA is
used to generate features while ANN is used as a classifier to
evaluate the produced features. Proposed system is applied on the
Wine dataset, the statistical result proves that the proposed system is
effective and has the ability to choose informative features with high
Classification of Political Affiliations by Reduced Number of Features
By the evolvement in technology, the way of
expressing opinions switched direction to the digital world. The
domain of politics, as one of the hottest topics of opinion mining
research, merged together with the behavior analysis for affiliation
determination in texts, which constitutes the subject of this paper.
This study aims to classify the text in news/blogs either as
Republican or Democrat with the minimum number of features. As
an initial set, 68 features which 64 were constituted by Linguistic
Inquiry and Word Count (LIWC) features were tested against 14
benchmark classification algorithms. In the later experiments, the
dimensions of the feature vector reduced based on the 7 feature
selection algorithms. The results show that the “Decision Tree”,
“Rule Induction” and “M5 Rule” classifiers when used with “SVM”
and “IGR” feature selection algorithms performed the best up to
82.5% accuracy on a given dataset. Further tests on a single feature
and the linguistic based feature sets showed the similar results. The
feature “Function”, as an aggregate feature of the linguistic category,
was found as the most differentiating feature among the 68 features
with the accuracy of 81% in classifying articles either as Republican
Multi-Layer Perceptron Neural Network Classifier with Binary Particle Swarm Optimization Based Feature Selection for Brain-Computer Interfaces
Brain-Computer Interfaces (BCIs) measure brain
signals activity, intentionally and unintentionally induced by users,
and provides a communication channel without depending on the
brain’s normal peripheral nerves and muscles output pathway.
Feature Selection (FS) is a global optimization machine learning
problem that reduces features, removes irrelevant and noisy data
resulting in acceptable recognition accuracy. It is a vital step
affecting pattern recognition system performance. This study presents
a new Binary Particle Swarm Optimization (BPSO) based feature
selection algorithm. Multi-layer Perceptron Neural Network
(MLPNN) classifier with backpropagation training algorithm and
Levenberg-Marquardt training algorithm classify selected features.
A Comparative Study of Malware Detection Techniques Using Machine Learning Methods
In the past few years, the amount of malicious software
increased exponentially and, therefore, machine learning algorithms
became instrumental in identifying clean and malware files through
(semi)-automated classification. When working with very large
datasets, the major challenge is to reach both a very high malware
detection rate and a very low false positive rate. Another challenge
is to minimize the time needed for the machine learning algorithm to
do so. This paper presents a comparative study between different
machine learning techniques such as linear classifiers, ensembles,
decision trees or various hybrids thereof. The training dataset consists
of approximately 2 million clean files and 200.000 infected files,
which is a realistic quantitative mixture. The paper investigates the
above mentioned methods with respect to both their performance
(detection rate and false positive rate) and their practicability.