A key issue in stock investment is how to select representative features for stock selection. The objective of this paper is to firstly determine whether an automated stock investment system, using machine learning techniques, may be used to identify a portfolio of growth stocks that are highly likely to provide returns better than the stock market index. The second objective is to identify the technical features that best characterize whether a stock’s price is likely to go up and to identify the most important factors and their contribution to predicting the likelihood of the stock price going up. Unsupervised machine learning techniques, such as cluster analysis, were applied to the stock data to identify a cluster of stocks that was likely to go up in price – portfolio 1. Next, the principal component analysis technique was used to select stocks that were rated high on component one and component two – portfolio 2. Thirdly, a supervised machine learning technique, the logistic regression method, was used to select stocks with a high probability of their price going up – portfolio 3. The predictive models were validated with metrics such as, sensitivity (recall), specificity and overall accuracy for all models. All accuracy measures were above 70%. All portfolios outperformed the market by more than eight times. The top three stocks were selected for each of the three stock portfolios and traded in the market for one month. After one month the return for each stock portfolio was computed and compared with the stock market index returns. The returns for all three stock portfolios was 23.87% for the principal component analysis stock portfolio, 11.65% for the logistic regression portfolio and 8.88% for the K-means cluster portfolio while the stock market performance was 0.38%. This study confirms that an automated stock investment system using machine learning techniques can identify top performing stock portfolios that outperform the stock market.
This paper presents a Machine Learning (ML) approach to support Meningitis diagnosis in patients at a children’s hospital in Sao Paulo, Brazil. The aim is to use ML techniques to reduce the use of invasive procedures, such as cerebrospinal fluid (CSF) collection, as much as possible. In this study, we focus on predicting the probability of Meningitis given the results of a blood and urine laboratory tests, together with the analysis of pain or other complaints from the patient. We tested a number of different ML algorithms, including: Adaptative Boosting (AdaBoost), Decision Tree, Gradient Boosting, K-Nearest Neighbors (KNN), Logistic Regression, Random Forest and Support Vector Machines (SVM). Decision Tree algorithm performed best, with 94.56% and 96.18% accuracy for training and testing data, respectively. These results represent a significant aid to doctors in diagnosing Meningitis as early as possible and in preventing expensive and painful procedures on some children.
2016 has become the year of the Artificial Intelligence explosion. AI technologies are getting more and more matured that most world well-known tech giants are making large investment to increase the capabilities in AI. Machine learning is the science of getting computers to act without being explicitly programmed, and deep learning is a subset of machine learning that uses deep neural network to train a machine to learn features directly from data. Deep learning realizes many machine learning applications which expand the field of AI. At the present time, deep learning frameworks have been widely deployed on servers for deep learning applications in both academia and industry. In training deep neural networks, there are many standard processes or algorithms, but the performance of different frameworks might be different. In this paper we evaluate the running performance of two state-of-the-art distributed deep learning frameworks that are running training calculation in parallel over multi GPU and multi nodes in our cloud environment. We evaluate the training performance of the frameworks with ResNet-50 convolutional neural network, and we analyze what factors that result in the performance among both distributed frameworks as well. Through the experimental analysis, we identify the overheads which could be further optimized. The main contribution is that the evaluation results provide further optimization directions in both performance tuning and algorithmic design.
Attention-Deficit/Hyperactivity Disorder (ADHD), epilepsy, and autism affect millions of children worldwide, many of which are undiagnosed despite the fact that all of these disorders are detectable in early childhood. Late diagnosis can cause severe problems due to the late treatment and to the misconceptions and lack of awareness as a whole towards these disorders. Moreover, electroencephalography (EEG) has played a vital role in the assessment of neural function in children. Therefore, quantitative EEG measurement will be utilized as a tool for use in the evaluation of patients who may have ADHD, epilepsy, and autism. We propose a screening tool that uses EEG signals and machine learning algorithms to detect these disorders at an early age in an automated manner. The proposed classifiers used with epilepsy as a step taken for the work done so far, provided an accuracy of approximately 97% using SVM, Naïve Bayes and Decision tree, while 98% using KNN, which gives hope for the work yet to be conducted.
Cognitive decline and frailty is apparent in older adults leading to an increased likelihood of the risk of falling. Currently health care professionals have to make professional decisions regarding such risks, and hence make difficult decisions regarding the future welfare of the ageing population. This study uses health data from The Irish Longitudinal Study on Ageing (TILDA), focusing on adults over the age of 50 years, in order to analyse health risk factors and predict the likelihood of falls. This prediction is based on the use of machine learning algorithms whereby health risk factors are used as inputs to predict the likelihood of falling. Initial results show that health risk factors such as long-term health issues contribute to the number of falls. The identification of such health risk factors has the potential to inform health and social care professionals, older people and their family members in order to mitigate daily living risks.
Accurate prediction of NOx emission is a continuous challenge in the field of diesel engine-out emission modeling. Performing experiments for each conditions and scenario cost significant amount of money and man hours, therefore model-based development strategy has been implemented in order to solve that issue. NOx formation is highly dependent on the burn gas temperature and the O2 concentration inside the cylinder. The current empirical models are developed by calibrating the parameters representing the engine operating conditions with respect to the measured NOx. This makes the prediction of purely empirical models limited to the region where it has been calibrated. An alternative solution to that is presented in this paper, which focus on the utilization of in-cylinder combustion parameters to form a predictive semi-empirical NOx model. The result of this work is shown by developing a fast and predictive NOx model by using the physical parameters and empirical correlation. The model is developed based on the steady state data collected at entire operating region of the engine and the predictive combustion model, which is developed in Gamma Technology (GT)-Power by using Direct Injected (DI)-Pulse combustion object. In this approach, temperature in both burned and unburnt zone is considered during the combustion period i.e. from Intake Valve Closing (IVC) to Exhaust Valve Opening (EVO). Also, the oxygen concentration consumed in burnt zone and trapped fuel mass is also considered while developing the reported model. Several statistical methods are used to construct the model, including individual machine learning methods and ensemble machine learning methods. A detailed validation of the model on multiple diesel engines is reported in this work. Substantial numbers of cases are tested for different engine configurations over a large span of speed and load points. Different sweeps of operating conditions such as Exhaust Gas Recirculation (EGR), injection timing and Variable Valve Timing (VVT) are also considered for the validation. Model shows a very good predictability and robustness at both sea level and altitude condition with different ambient conditions. The various advantages such as high accuracy and robustness at different operating conditions, low computational time and lower number of data points requires for the calibration establishes the platform where the model-based approach can be used for the engine calibration and development process. Moreover, the focus of this work is towards establishing a framework for the future model development for other various targets such as soot, Combustion Noise Level (CNL), NO2/NOx ratio etc.
With the continuous increment of smart meter installations across the globe, the need for processing of the load data is evident. Clustering-based load profiling is built upon the utilization of unsupervised machine learning tools for the purpose of formulating the typical load curves or load profiles. The most commonly used algorithm in the load profiling literature is the K-means. While the algorithm has been successfully tested in a variety of applications, its drawback is the strong dependence in the initialization phase. This paper proposes a novel modified form of the K-means that addresses the aforementioned problem. Simulation results indicate the superiority of the proposed algorithm compared to the K-means.
The Greek Energy Market is structured as a mandatory pool where the producers make their bid offers in day-ahead basis. The System Operator solves an optimization routine aiming at the minimization of the cost of produced electricity. The solution of the optimization problem leads to the calculation of the System Marginal Price (SMP). Accurate forecasts of the SMP can lead to increased profits and more efficient portfolio management from the producer`s perspective. Aim of this study is to provide a comparative analysis of various machine learning models such as artificial neural networks and neuro-fuzzy models for the prediction of the SMP of the Greek market. Machine learning algorithms are favored in predictions problems since they can capture and simulate the volatilities of complex time series.
A major challenge in medical studies, especially those that are longitudinal, is the problem of missing measurements which hinders the effective application of many machine learning algorithms. Furthermore, recent Alzheimer's Disease studies have focused on the delineation of Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI) from cognitively normal controls (CN) which is essential for developing effective and early treatment methods. To address the aforementioned challenges, this paper explores the potential of using the eXtreme Gradient Boosting (XGBoost) algorithm in handling missing values in multiclass classification. We seek a generalized classification scheme where all prodromal stages of the disease are considered simultaneously in the classification and decision-making processes. Given the large number of subjects (1631) included in this study and in the presence of almost 28% missing values, we investigated the performance of XGBoost on the classification of the four classes of AD, NC, EMCI, and LMCI. Using 10-fold cross validation technique, XGBoost is shown to outperform other state-of-the-art classification algorithms by 3% in terms of accuracy and F-score. Our model achieved an accuracy of 80.52%, a precision of 80.62% and recall of 80.51%, supporting the more natural and promising multiclass classification.
Gradient boosting methods have been proven to be a very important strategy. Many successful machine learning solutions were developed using the XGBoost and its derivatives. The aim of this study is to investigate and compare the efficiency of three gradient methods. Home credit dataset is used in this work which contains 219 features and 356251 records. However, new features are generated and several techniques are used to rank and select the best features. The implementation indicates that the LightGBM is faster and more accurate than CatBoost and XGBoost using variant number of features and records.
Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.
Online marketplaces are not only digital places where consumers buy and sell merchandise, and they are also destinations for brands to connect with real consumers at the moment when customers are in the shopping mindset. For many marketplaces, brands have been important partners through advertising. There can be, however, a risk of advertising impacting a consumer’s shopping journey if it hurts the use experience or takes the user away from the site. Both could lead to the loss of transaction revenue for the marketplace. In this paper, we present user-based methods for cannibalization control by selectively turning off ads to users who are likely to be cannibalized by ads subject to business objectives. We present ways of measuring cannibalization of advertising in the context of an online marketplace and propose novel ways of measuring cannibalization through purchase propensity and uplift modeling. A/B testing has shown that our methods can significantly improve user purchase and engagement metrics while operating within business objectives. To our knowledge, this is the first paper that addresses cannibalization mitigation at the user-level in the context of advertising.
Network security engineers work to keep services available all the time by handling intruder attacks. Intrusion Detection System (IDS) is one of the obtainable mechanisms that is used to sense and classify any abnormal actions. Therefore, the IDS must be always up to date with the latest intruder attacks signatures to preserve confidentiality, integrity, and availability of the services. The speed of the IDS is a very important issue as well learning the new attacks. This research work illustrates how the Knowledge Discovery and Data Mining (or Knowledge Discovery in Databases) KDD dataset is very handy for testing and evaluating different Machine Learning Techniques. It mainly focuses on the KDD preprocess part in order to prepare a decent and fair experimental data set. The J48, MLP, and Bayes Network classifiers have been chosen for this study. It has been proven that the J48 classifier has achieved the highest accuracy rate for detecting and classifying all KDD dataset attacks, which are of type DOS, R2L, U2R, and PROBE.
This paper has critically examined the use of Machine Learning procedures in curbing unauthorized access into valuable areas of an organization. The use of passwords, pin codes, user’s identification in recent times has been partially successful in curbing crimes involving identities, hence the need for the design of a system which incorporates biometric characteristics such as DNA and pattern recognition of variations in facial expressions. The facial model used is the OpenCV library which is based on the use of certain physiological features, the Raspberry Pi 3 module is used to compile the OpenCV library, which extracts and stores the detected faces into the datasets directory through the use of camera. The model is trained with 50 epoch run in the database and recognized by the Local Binary Pattern Histogram (LBPH) recognizer contained in the OpenCV. The training algorithm used by the neural network is back propagation coded using python algorithmic language with 200 epoch runs to identify specific resemblance in the exclusive OR (XOR) output neurons. The research however confirmed that physiological parameters are better effective measures to curb crimes relating to identities.
This paper presents a method for improving object search accuracy using a deep learning model. A major limitation to provide accurate similarity with deep learning is the requirement of huge amount of data for training pairwise similarity scores (metrics), which is impractical to collect. Thus, similarity scores are usually trained with a relatively small dataset, which comes from a different domain, causing limited accuracy on measuring similarity. For this reason, this paper proposes a deep learning model that can be trained with a significantly small amount of data, a clustered data which of each cluster contains a set of visually similar images. In order to measure similarity distance with the proposed method, visual features of two images are extracted from intermediate layers of a convolutional neural network with various pooling methods, and the network is trained with pairwise similarity scores which is defined zero for images in identical cluster. The proposed method outperforms the state-of-the-art object similarity scoring techniques on evaluation for finding exact items. The proposed method achieves 86.5% of accuracy compared to the accuracy of the state-of-the-art technique, which is 59.9%. That is, an exact item can be found among four retrieved images with an accuracy of 86.5%, and the rest can possibly be similar products more than the accuracy. Therefore, the proposed method can greatly reduce the amount of training data with an order of magnitude as well as providing a reliable similarity metric.
Hand gesture recognition is a technique used to locate, detect, and recognize a hand gesture. Detection and recognition are concepts of Artificial Intelligence (AI). AI concepts are applicable in Human Computer Interaction (HCI), Expert systems (ES), etc. Hand gesture recognition can be used in sign language interpretation. Sign language is a visual communication tool. This tool is used mostly by deaf societies and those with speech disorder. Communication barriers exist when societies with speech disorder interact with others. This research aims to build a hand recognition system for Lesotho’s Sesotho and English language interpretation. The system will help to bridge the communication problems encountered by the mentioned societies. The system has various processing modules. The modules consist of a hand detection engine, image processing engine, feature extraction, and sign recognition. Detection is a process of identifying an object. The proposed system uses Canny pruning Haar and Haarcascade detection algorithms. Canny pruning implements the Canny edge detection. This is an optimal image processing algorithm. It is used to detect edges of an object. The system employs a skin detection algorithm. The skin detection performs background subtraction, computes the convex hull, and the centroid to assist in the detection process. Recognition is a process of gesture classification. Template matching classifies each hand gesture in real-time. The system was tested using various experiments. The results obtained show that time, distance, and light are factors that affect the rate of detection and ultimately recognition. Detection rate is directly proportional to the distance of the hand from the camera. Different lighting conditions were considered. The more the light intensity, the faster the detection rate. Based on the results obtained from this research, the applied methodologies are efficient and provide a plausible solution towards a light-weight, inexpensive system which can be used for sign language interpretation.
Intrusion detection systems (IDS) are the main components of network security. These systems analyze the network events for intrusion detection. The design of an IDS is through the training of normal traffic data or attack. The methods of machine learning are the best ways to design IDSs. In the method presented in this article, the pruning algorithm of C5.0 decision tree is being used to reduce the features of traffic data used and training IDS by the least square vector algorithm (LS-SVM). Then, the remaining features are arranged according to the predictor importance criterion. The least important features are eliminated in the order. The remaining features of this stage, which have created the highest level of accuracy in LS-SVM, are selected as the final features. The features obtained, compared to other similar articles which have examined the selected features in the least squared support vector machine model, are better in the accuracy, true positive rate, and false positive. The results are tested by the UNSW-NB15 dataset.