International Science Index
Modeling Engagement with Multimodal Multisensor Data: The Continuous Performance Test as an Objective Tool to Track Flow
Engagement is one of the most important factors in determining successful outcomes and deep learning in students. Existing approaches to detect student engagement involve periodic human observations that are subject to inter-rater reliability. Our solution uses real-time multimodal multisensor data labeled by objective performance outcomes to infer the engagement of students. The study involves four students with a combined diagnosis of cerebral palsy and a learning disability who took part in a 3-month trial over 59 sessions. Multimodal multisensor data were collected while they participated in a continuous performance test. Eye gaze, electroencephalogram, body pose, and interaction data were used to create a model of student engagement through objective labeling from the continuous performance test outcomes. In order to achieve this, a type of continuous performance test is introduced, the Seek-X type. Nine features were extracted including high-level handpicked compound features. Using leave-one-out cross-validation, a series of different machine learning approaches were evaluated. Overall, the random forest classification approach achieved the best classification results. Using random forest, 93.3% classification for engagement and 42.9% accuracy for disengagement were achieved. We compared these results to outcomes from different models: AdaBoost, decision tree, k-Nearest Neighbor, naïve Bayes, neural network, and support vector machine. We showed that using a multisensor approach achieved higher accuracy than using features from any reduced set of sensors. We found that using high-level handpicked features can improve the classification accuracy in every sensor mode. Our approach is robust to both sensor fallout and occlusions. The single most important sensor feature to the classification of engagement and distraction was shown to be eye gaze. It has been shown that we can accurately predict the level of engagement of students with learning disabilities in a real-time approach that is not subject to inter-rater reliability, human observation or reliant on a single mode of sensor input. This will help teachers design interventions for a heterogeneous group of students, where teachers cannot possibly attend to each of their individual needs. Our approach can be used to identify those with the greatest learning challenges so that all students are supported to reach their full potential.
Affective computing in education
, affect detection
, continuous performance test
, learning disabilities
, machine learning
, physiological sensors
, Signal Detection Theory
, student engagement.
In situ Real-Time Multivariate Analysis of Methanolysis Monitoring of Sunflower Oil Using FTIR
The combination of world population and the third industrial revolution led to high demand for fuels. On the other hand, the decrease of global fossil 8fuels deposits and the environmental air pollution caused by these fuels has compounded the challenges the world faces due to its need for energy. Therefore, new forms of environmentally friendly and renewable fuels such as biodiesel are needed. The primary analytical techniques for methanolysis yield monitoring have been chromatography and spectroscopy, these methods have been proven reliable but are more demanding, costly and do not provide real-time monitoring. In this work, the in situ monitoring of biodiesel from sunflower oil using FTIR (Fourier Transform Infrared) has been studied; the study was performed using EasyMax Mettler Toledo reactor equipped with a DiComp (Diamond) probe. The quantitative monitoring of methanolysis was performed by building a quantitative model with multivariate calibration using iC Quant module from iC IR 7.0 software. 15 samples of known concentrations were used for the modelling which were taken in duplicate for model calibration and cross-validation, data were pre-processed using mean centering and variance scale, spectrum math square root and solvent subtraction. These pre-processing methods improved the performance indexes from 7.98 to 0.0096, 11.2 to 3.41, 6.32 to 2.72, 0.9416 to 0.9999, RMSEC, RMSECV, RMSEP and R2Cum, respectively. The R2 value of 1 (training), 0.9918 (test), 0.9946 (cross-validation) indicated the fitness of the model built. The model was tested against univariate model; small discrepancies were observed at low concentration due to unmodelled intermediates but were quite close at concentrations above 18%. The software eliminated the complexity of the Partial Least Square (PLS) chemometrics. It was concluded that the model obtained could be used to monitor methanol of sunflower oil at industrial and lab scale.
Optimizing the Probabilistic Neural Network Training Algorithm for Multi-Class Identification
In this work, a training algorithm for probabilistic neural networks (PNN) is presented. The algorithm addresses one of the major drawbacks of PNN, which is the size of the hidden layer in the network. By using a cross-validation training algorithm, the number of hidden neurons is shrunk to a smaller number consisting of the most representative samples of the training set. This is done without affecting the overall architecture of the network. Performance of the network is compared against performance of standard PNN for different databases from the UCI database repository. Results show an important gain in network size and performance.
Specific Emitter Identification Based on Refined Composite Multiscale Dispersion Entropy
The wireless communication network is developing
rapidly, thus the wireless security becomes more and more important.
Specific emitter identification (SEI) is an vital part of wireless
communication security as a technique to identify the unique
transmitters. In this paper, a SEI method based on multiscale
dispersion entropy (MDE) and refined composite multiscale dispersion
entropy (RCMDE) is proposed. The algorithms of MDE and RCMDE
are used to extract features for identification of five wireless
devices and cross-validation support vector machine (CV-SVM)
is used as the classifier. The experimental results show that the
total identification accuracy is 99.3%, even at low signal-to-noise
ratio(SNR) of 5dB, which proves that MDE and RCMDE can
describe the communication signal series well. In addition, compared
with other methods, the proposed method is effective and provides
better accuracy and stability for SEI.
Multi-Level Air Quality Classification in China Using Information Gain and Support Vector Machine
Machine Learning and Data Mining are the two important tools for extracting useful information and knowledge from large datasets. In machine learning, classification is a wildly used technique to predict qualitative variables and is generally preferred over regression from an operational point of view. Due to the enormous increase in air pollution in various countries especially China, Air Quality Classification has become one of the most important topics in air quality research and modelling. This study aims at introducing a hybrid classification model based on information theory and Support Vector Machine (SVM) using the air quality data of four cities in China namely Beijing, Guangzhou, Shanghai and Tianjin from Jan 1, 2014 to April 30, 2016. China's Ministry of Environmental Protection has classified the daily air quality into 6 levels namely Serious Pollution, Severe Pollution, Moderate Pollution, Light Pollution, Good and Excellent based on their respective Air Quality Index (AQI) values. Using the information theory, information gain (IG) is calculated and feature selection is done for both categorical features and continuous numeric features. Then SVM Machine Learning algorithm is implemented on the selected features with cross-validation. The final evaluation reveals that the IG and SVM hybrid model performs better than SVM (alone), Artificial Neural Network (ANN) and K-Nearest Neighbours (KNN) models in terms of accuracy as well as complexity.
PM10 Prediction and Forecasting Using CART: A Case Study for Pleven, Bulgaria
Ambient air pollution with fine particulate matter (PM10) is a systematic permanent problem in many countries around the world. The accumulation of a large number of measurements of both the PM10 concentrations and the accompanying atmospheric factors allow for their statistical modeling to detect dependencies and forecast future pollution. This study applies the classification and regression trees (CART) method for building and analyzing PM10 models. In the empirical study, average daily air data for the city of Pleven, Bulgaria for a period of 5 years are used. Predictors in the models are seven meteorological variables, time variables, as well as lagged PM10 variables and some lagged meteorological variables, delayed by 1 or 2 days with respect to the initial time series, respectively. The degree of influence of the predictors in the models is determined. The selected best CART models are used to forecast future PM10 concentrations for two days ahead after the last date in the modeling procedure and show very accurate results.
Model-Driven and Data-Driven Approaches for Crop Yield Prediction: Analysis and Comparison
Crop yield prediction is a paramount issue in
agriculture. The main idea of this paper is to find out efficient
way to predict the yield of corn based meteorological records.
The prediction models used in this paper can be classified into
model-driven approaches and data-driven approaches, according to
the different modeling methodologies. The model-driven approaches are based on crop mechanistic
modeling. They describe crop growth in interaction with their
environment as dynamical systems. But the calibration process of
the dynamic system comes up with much difficulty, because it
turns out to be a multidimensional non-convex optimization problem.
An original contribution of this paper is to propose a statistical
methodology, Multi-Scenarios Parameters Estimation (MSPE), for the
parametrization of potentially complex mechanistic models from a
new type of datasets (climatic data, final yield in many situations).
It is tested with CORNFLO, a crop model for maize growth. On the other hand, the data-driven approach for yield prediction
is free of the complex biophysical process. But it has some strict
requirements about the dataset.
A second contribution of the paper is the comparison of these
model-driven methods with classical data-driven methods. For this
purpose, we consider two classes of regression methods, methods
derived from linear regression (Ridge and Lasso Regression, Principal
Components Regression or Partial Least Squares Regression) and
machine learning methods (Random Forest, k-Nearest Neighbor,
Artificial Neural Network and SVM regression).
The dataset consists of 720 records of corn yield at county scale
provided by the United States Department of Agriculture (USDA) and
the associated climatic data. A 5-folds cross-validation process and
two accuracy metrics: root mean square error of prediction(RMSEP),
mean absolute error of prediction(MAEP) were used to evaluate the
crop prediction capacity.
The results show that among the data-driven approaches, Random
Forest is the most robust and generally achieves the best prediction
error (MAEP 4.27%). It also outperforms our model-driven approach
(MAEP 6.11%). However, the method to calibrate the mechanistic
model from dataset easy to access offers several side-perspectives.
The mechanistic model can potentially help to underline the stresses
suffered by the crop or to identify the biological parameters of interest
for breeding purposes. For this reason, an interesting perspective is
to combine these two types of approaches.
A Psychophysiological Evaluation of an Effective Recognition Technique Using Interactive Dynamic Virtual Environments
Recording psychological and physiological correlates of human performance within virtual environments and interpreting their impacts on human engagement, ‘immersion’ and related emotional or ‘effective’ states is both academically and technologically challenging. By exposing participants to an effective, real-time (game-like) virtual environment, designed and evaluated in an earlier study, a psychophysiological database containing the EEG, GSR and Heart Rate of 30 male and female gamers, exposed to 10 games, was constructed. Some 174 features were subsequently identified and extracted from a number of windows, with 28 different timing lengths (e.g. 2, 3, 5, etc. seconds). After reducing the number of features to 30, using a feature selection technique, K-Nearest Neighbour (KNN) and Support Vector Machine (SVM) methods were subsequently employed for the classification process. The classifiers categorised the psychophysiological database into four effective clusters (defined based on a 3-dimensional space – valence, arousal and dominance) and eight emotion labels (relaxed, content, happy, excited, angry, afraid, sad, and bored). The KNN and SVM classifiers achieved average cross-validation accuracies of 97.01% (±1.3%) and 92.84% (±3.67%), respectively. However, no significant differences were found in the classification process based on effective clusters or emotion labels.
A Linear Regression Model for Estimating Anxiety Index Using Wide Area Frontal Lobe Brain Blood Volume
Major depressive disorder (MDD) is one of the most common mental illnesses today. It is believed to be caused by a combination of several factors, including stress. Stress can be quantitatively evaluated using the State-Trait Anxiety Inventory (STAI), one of the best indices to evaluate anxiety. Although STAI scores are widely used in applications ranging from clinical diagnosis to basic research, the scores are calculated based on a self-reported questionnaire. An objective evaluation is required because the subject may intentionally change his/her answers if multiple tests are carried out. In this article, we present a modified index called the “multi-channel Laterality Index at Rest (mc-LIR)” by recording the brain activity from a wider area of the frontal lobe using multi-channel functional near-infrared spectroscopy (fNIRS). The presented index aims to measure multiple positions near the Fpz defined by the international 10-20 system positioning. Using 24 subjects, the dependencies on the number of measuring points used to calculate the mc-LIR and its correlation coefficients with the STAI scores are reported. Furthermore, a simple linear regression was performed to estimate the STAI scores from mc-LIR. The cross-validation error is also reported. The experimental results show that using multiple positions near the Fpz will improve the correlation coefficients and estimation than those using only two positions.
Radiochemical Purity of 68Ga-BCA-Peptides: Separation of All 68Ga Species with a Single iTLC Strip
In the present study, highly effective iTLC single strip method for the determination of radiochemical purity (RCP) of 68Ga-BCA-peptides was developed (with no double-developing, changing of eluents or other additional manipulation). In this method iTLC-SG strips and commonly used eluent TFAaq. (3-5 % (v/v)) are used. The method allows determining each of the key radiochemical forms of 68Ga (colloidal, bound, ionic) separately with the peaks separation being no less than 4 σ. Rf = 0.0-0.1 for 68Ga-colloid; Rf = 0.5-0.6 for 68Ga-BCA-peptides; Rf = 0.9-1.0 for ionic 68Ga. The method is simple and fast: For developing length of 75 mm only 4-6 min is required (versus 18-20 min for pharmacopoeial method). The method has been tested on various compounds (including 68Ga-DOTA-TOC, 68Ga-DOTA-TATE, 68Ga-NODAGA-RGD2 etc.). The cross-validation work for every specific form of 68Ga showed good correlation between method developed and control (pharmacopoeial) methods. The method can become convenient and much more informative replacement for pharmacopoeial methods, including HPLC.
Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models
As DNA microarray data contain relatively small
sample size compared to the number of genes, high dimensional
models are often employed. In high dimensional models, the selection
of tuning parameter (or, penalty parameter) is often one of the crucial
parts of the modeling. Cross-validation is one of the most common
methods for the tuning parameter selection, which selects a parameter
value with the smallest cross-validated score. However, selecting a
single value as an ‘optimal’ value for the parameter can be very
unstable due to the sampling variation since the sample sizes of
microarray data are often small. Our approach is to choose multiple candidates of tuning parameter
first, then average the candidates with different weights depending
on their performance. The additional step of estimating the weights
and averaging the candidates rarely increase the computational cost,
while it can considerably improve the traditional cross-validation. We
show that the selected value from the suggested methods often lead to
stable parameter selection as well as improved detection of significant
genetic variables compared to the tradition cross-validation via real
data and simulated data sets.
Classifying Students for E-Learning in Information Technology Course Using ANN
This research’s objective is to select the model with
most accurate value by using Neural Network Technique as a way to
filter potential students who enroll in IT course by Electronic learning
at Suan Suanadha Rajabhat University. It is designed to help students
selecting the appropriate courses by themselves. The result showed
that the most accurate model was 100 Folds Cross-validation which
had 73.58% points of accuracy.
Vehicle Type Classification with Geometric and Appearance Attributes
With the increase in population along with economic prosperity, an enormous increase in the number and types of vehicles on the roads occurred. This fact brings a growing need for efficiently yet effectively classifying vehicles into their corresponding categories, which play a crucial role in many areas of infrastructure planning and traffic management.
This paper presents two vehicle-type classification approaches; 1) geometric-based and 2) appearance-based. The two classification approaches are used for two tasks: multi-class and intra-class vehicle classifications. For the evaluation purpose of the proposed classification approaches’ performance and the identification of the most effective yet efficient one, 10-fold cross-validation technique is used with a large dataset. The proposed approaches are distinguishable from previous research on vehicle classification in which: i) they consider both geometric and appearance attributes of vehicles, and ii) they perform remarkably well in both multi-class and intra-class vehicle classification. Experimental results exhibit promising potentials implementations of the proposed vehicle classification approaches into real-world applications.
QSAR Studies of Certain Novel Heterocycles Derived from Bis-1, 2, 4 Triazoles as Anti-Tumor Agents
In this paper we report the quantitative structure activity relationship of novel bis-triazole derivatives for predicting the activity profile. The full model encompassed a dataset of 46 Bis- triazoles. Tripos Sybyl X 2.0 program was used to conduct CoMSIA QSAR modeling. The Partial Least-Squares (PLS) analysis method was used to conduct statistical analysis and to derive a QSAR model based on the field values of CoMSIA descriptor. The compounds were divided into test and training set. The compounds were evaluated by various CoMSIA parameters to predict the best QSAR model. An optimum numbers of components were first determined separately by cross-validation regression for CoMSIA model, which were then applied in the final analysis. A series of parameters were used for the study and the best fit model was obtained using donor, partition coefficient and steric parameters. The CoMSIA models demonstrated good statistical results with regression coefficient (r2) and the cross-validated coefficient (q2) of 0.575 and 0.830 respectively. The standard error for the predicted model was 0.16322. In the CoMSIA model, the steric descriptors make a marginally larger contribution than the electrostatic descriptors. The finding that the steric descriptor is the largest contributor for the CoMSIA QSAR models is consistent with the observation that more than half of the binding site area is occupied by steric regions.
Stature Estimation Based On Lower Limb Dimensions in the Malaysian Population
Estimation of stature is an important step in developing a biological profile for human identification. It may provide a valuable indicator for unknown individual in a population. The aim of this study was to analyses the relationship between stature and lower limb dimensions in the Malaysian population. The sample comprised 100 corpses, which included 69 males and 31 females between age ranges of 20 to 90 years old. The parameters measured were stature, thigh length, lower leg length, leg length, foot length, foot height and foot breadth. Results showed that mean values in males were significantly higher than those in females (P < 0.05). There were significant correlations between lower limb dimensions and stature. Cross-validation of the equation on 100 individuals showed close approximation between known stature and estimated stature. It was concluded that lower limb dimensions were useful for estimation of stature, which should be validated in future studies.
Autonomously Determining the Parameters for SVDD with RBF Kernel from a One-Class Training Set
The one-class support vector machine “support vector
data description” (SVDD) is an ideal approach for anomaly or outlier
detection. However, for the applicability of SVDD in real-world
applications, the ease of use is crucial. The results of SVDD are
massively determined by the choice of the regularisation parameter C
and the kernel parameter of the widely used RBF kernel. While for
two-class SVMs the parameters can be tuned using cross-validation
based on the confusion matrix, for a one-class SVM this is not
possible, because only true positives and false negatives can occur
during training. This paper proposes an approach to find the optimal
set of parameters for SVDD solely based on a training set from
one class and without any user parameterisation. Results on artificial
and real data sets are presented, underpinning the usefulness of the
Comparative Study of Filter Characteristics as Statistical Vocal Correlates of Clinical Psychiatric State in Human
Acoustical properties of speech have been shown to
be related to mental states of speaker with symptoms: depression
and remission. This paper describes way to address the issue of
distinguishing depressed patients from remitted subjects based on
measureable acoustics change of their spoken sound. The vocal-tract
related frequency characteristics of speech samples from female
remitted and depressed patients were analyzed via speech
processing techniques and consequently, evaluated statistically by
cross-validation with Support Vector Machine. Our results
comparatively show the classifier's performance with effectively
correct separation of 93% determined from testing with the subjectbased
feature model and 88% from the frame-based model based on
the same speech samples collected from hospital visiting interview
sessions between patients and psychiatrists.
Geostatistical Analysis and Mapping of Groundlevel Ozone in a Medium Sized Urban Area
Ground-level tropospheric ozone is one of the air
pollutants of most concern. It is mainly produced by photochemical
processes involving nitrogen oxides and volatile organic compounds
in the lower parts of the atmosphere. Ozone levels become
particularly high in regions close to high ozone precursor emissions
and during summer, when stagnant meteorological conditions with
high insolation and high temperatures are common.
In this work, some results of a study about urban ozone
distribution patterns in the city of Badajoz, which is the largest and
most industrialized city in Extremadura region (southwest Spain) are
shown. Fourteen sampling campaigns, at least one per month, were
carried out to measure ambient air ozone concentrations, during
periods that were selected according to favourable conditions to
ozone production, using an automatic portable analyzer.
Later, to evaluate the ozone distribution at the city, the measured
ozone data were analyzed using geostatistical techniques. Thus, first,
during the exploratory analysis of data, it was revealed that they were
distributed normally, which is a desirable property for the subsequent
stages of the geostatistical study. Secondly, during the structural
analysis of data, theoretical spherical models provided the best fit for
all monthly experimental variograms. The parameters of these
variograms (sill, range and nugget) revealed that the maximum
distance of spatial dependence is between 302-790 m and the
variable, air ozone concentration, is not evenly distributed in reduced
distances. Finally, predictive ozone maps were derived for all points
of the experimental study area, by use of geostatistical algorithms
(kriging). High prediction accuracy was obtained in all cases as
cross-validation showed. Useful information for hazard assessment
was also provided when probability maps, based on kriging
interpolation and kriging standard deviation, were produced.
Virulent-GO: Prediction of Virulent Proteins in Bacterial Pathogens Utilizing Gene Ontology Terms
Prediction of bacterial virulent protein sequences can
give assistance to identification and characterization of novel
virulence-associated factors and discover drug/vaccine targets against
proteins indispensable to pathogenicity. Gene Ontology (GO)
annotation which describes functions of genes and gene products as a
controlled vocabulary of terms has been shown effectively for a
variety of tasks such as gene expression study, GO annotation
prediction, protein subcellular localization, etc. In this study, we
propose a sequence-based method Virulent-GO by mining informative
GO terms as features for predicting bacterial virulent proteins.
Each protein in the datasets used by the existing method
VirulentPred is annotated by using BLAST to obtain its homologies
with known accession numbers for retrieving GO terms. After
investigating various popular classifiers using the same five-fold
cross-validation scheme, Virulent-GO using the single kind of GO
term features with an accuracy of 82.5% is slightly better than
VirulentPred with 81.8% using five kinds of sequence-based features.
For the evaluation of independent test, Virulent-GO also yields better
results (82.0%) than VirulentPred (80.7%). When evaluating single
kind of feature with SVM, the GO term feature performs much well,
compared with each of the five kinds of features.
Detailed Mapping of Pyroclastic Flow Deposits by SAR Data Processing for an Active Volcano in the Torrid Zone
Field mapping activity for an active volcano mainly in
the Torrid Zone is usually hampered by several problems such as steep
terrain and bad atmosphere conditions. In this paper we present a
simple solution for such problem by a combination Synthetic Aperture
Radar (SAR) and geostatistical methods. By this combination, we
could reduce the speckle effect from the SAR data and then estimate
roughness distribution of the pyroclastic flow deposits. The main
purpose of this study is to detect spatial distribution of new pyroclastic
flow deposits termed as P-zone accurately using the β°data from two
RADARSAT-1 SAR level-0 data. Single scene of Hyperion data and
field observation were used for cross-validation of the SAR results.
Mt. Merapi in central Java, Indonesia, was chosen as a study site and
the eruptions in May-June 2006 were examined. The P-zones were
found in the western and southern flanks. The area size and the longest
flow distance were calculated as 2.3 km2 and 6.8 km, respectively. The
grain size variation of the P-zone was mapped in detail from fine to
coarse deposits regarding the C-band wavelength of 5.6 cm.
Performance Optimization of Data Mining Application Using Radial Basis Function Classifier
Text data mining is a process of exploratory data
analysis. Classification maps data into predefined groups or classes.
It is often referred to as supervised learning because the classes are
determined before examining the data. This paper describes proposed
radial basis function Classifier that performs comparative crossvalidation
for existing radial basis function Classifier. The feasibility
and the benefits of the proposed approach are demonstrated by means
of data mining problem: direct Marketing. Direct marketing has
become an important application field of data mining. Comparative
Cross-validation involves estimation of accuracy by either stratified
k-fold cross-validation or equivalent repeated random subsampling.
While the proposed method may have high bias; its performance
(accuracy estimation in our case) may be poor due to high variance.
Thus the accuracy with proposed radial basis function Classifier was
less than with the existing radial basis function Classifier. However
there is smaller the improvement in runtime and larger improvement
in precision and recall. In the proposed method Classification
accuracy and prediction accuracy are determined where the
prediction accuracy is comparatively high.
Ensembling Adaptively Constructed Polynomial Regression Models
The approach of subset selection in polynomial
regression model building assumes that the chosen fixed full set of
predefined basis functions contains a subset that is sufficient to
describe the target relation sufficiently well. However, in most cases
the necessary set of basis functions is not known and needs to be
guessed – a potentially non-trivial (and long) trial and error process.
In our research we consider a potentially more efficient approach –
Adaptive Basis Function Construction (ABFC). It lets the model
building method itself construct the basis functions necessary for
creating a model of arbitrary complexity with adequate predictive
performance. However, there are two issues that to some extent
plague the methods of both the subset selection and the ABFC,
especially when working with relatively small data samples: the
selection bias and the selection instability. We try to correct these
issues by model post-evaluation using Cross-Validation and model
ensembling. To evaluate the proposed method, we empirically
compare it to ABFC methods without ensembling, to a widely used
method of subset selection, as well as to some other well-known
regression modeling methods, using publicly available data sets.
An Evaluation of Algorithms for Single-Echo Biosonar Target Classification
A recent neurospiking coding scheme for feature extraction from biosonar echoes of various plants is examined with avariety of stochastic classifiers. Feature vectors derived are employedin well-known stochastic classifiers, including nearest-neighborhood,single Gaussian and a Gaussian mixture with EM optimization.Classifiers' performances are evaluated by using cross-validation and bootstrapping techniques. It is shown that the various classifers perform equivalently and that the modified preprocessing configuration yields considerably improved results.