Two earliness cotton genotypes I and II, which had been developed by hybridization and backcross methods between sindise-80 as an early maturing gene parent and two other lines i.e. Red leaf and Bulgare-557 as a second parent, are subjected to different environmental conditions. The early maturing genotypes with coded names of I and II were compared with four native cotton cultivars in randomized complete block design (RCBD) with four replications in three ecological regions of Iran from 2016-2017. Two early maturing genotypes along with four native cultivars viz. Varamin, Oltan, Sahel and Arya were planted in Agricultural Research Station of Varamin, Moghan and Kashmar for evaluation. Earliness data were collected for six treatments during two years in the three regions except missing data for the second year of Kashmar. Therefore, missed data were estimated and imputed. For testing the homogeneity of error variances, each experiment at a given location or year is analyzed separately using Hartley and Bartlett’s Chi-square tests and both tests confirmed homogeneity of variance. Combined analysis of variance showed that genotypes I and II were superior in Varamin, Moghan and Kashmar regions. Earliness means and their interaction effects were compared with Duncan’s multiple range tests. Finally combined analysis of variance showed that genotypes I and II were superior in Varamin, Moghan and Kashmar regions. Earliness means and their interaction effects are compared with Duncan’s multiple range tests.
A major challenge in medical studies, especially those that are longitudinal, is the problem of missing measurements which hinders the effective application of many machine learning algorithms. Furthermore, recent Alzheimer's Disease studies have focused on the delineation of Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI) from cognitively normal controls (CN) which is essential for developing effective and early treatment methods. To address the aforementioned challenges, this paper explores the potential of using the eXtreme Gradient Boosting (XGBoost) algorithm in handling missing values in multiclass classification. We seek a generalized classification scheme where all prodromal stages of the disease are considered simultaneously in the classification and decision-making processes. Given the large number of subjects (1631) included in this study and in the presence of almost 28% missing values, we investigated the performance of XGBoost on the classification of the four classes of AD, NC, EMCI, and LMCI. Using 10-fold cross validation technique, XGBoost is shown to outperform other state-of-the-art classification algorithms by 3% in terms of accuracy and F-score. Our model achieved an accuracy of 80.52%, a precision of 80.62% and recall of 80.51%, supporting the more natural and promising multiclass classification.
STRIM (Statistical Test Rule Induction Method) has been proposed as a method to effectively induct if-then rules from the decision table which is considered as a sample set obtained from the population of interest. Its usefulness has been confirmed by simulation experiments specifying rules in advance, and by comparison with conventional methods. However, scope for future development remains before STRIM can be applied to the analysis of real-world data sets. The first requirement is to determine the size of the dataset needed for inducting true rules, since finding statistically significant rules is the core of the method. The second is to examine the capacity of rule induction from datasets with contaminated attribute values created by missing data and noise, since real-world datasets usually contain such contaminated data. This paper examines the first problem theoretically, in connection with the rule length. The second problem is then examined in a simulation experiment, utilizing the critical size of dataset derived from the first step. The experimental results show that STRIM is highly robust in the analysis of datasets with contaminated attribute values, and hence is applicable to real-world data
Safety is one of the most important considerations when buying a new car. While active safety aims at avoiding accidents, passive safety systems such as airbags and seat belts protect the occupant in case of an accident. In addition to legal regulations, organizations like Euro NCAP provide consumers with an independent assessment of the safety performance of cars and drive the development of safety systems in automobile industry. Those ratings are mainly based on injury assessment reference values derived from physical parameters measured in dummies during a car crash test. The components and sub-systems of a safety system are designed to achieve the required restraint performance. Sled tests and other types of tests are then carried out by car makers and their suppliers to confirm the protection level of the safety system. A Knowledge Discovery in Databases (KDD) process is proposed in order to minimize the number of tests. The KDD process is based on the data emerging from sled tests according to Euro NCAP specifications. About 30 parameters of the passive safety systems from different data sources (crash data, dummy protocol) are first analysed together with experts opinions. A procedure is proposed to manage missing data and validated on real data sets. Finally, a procedure is developed to estimate a set of rough initial parameters of the passive system before testing aiming at reducing the number of tests.
Analyzing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.
Via a large scale cross-sectional study among Japanese white color workers, the authors aimed to elucidate: (1) the distributions of Sense of Coherence (SOC), which reflect stress coping abilities, (2) the distributions of Life experience; (3) and the association between SOC and Life experience. Anonymous self-administered questionnaires were sent to 15,891 in 2001 and 21,922 in 2011 employees at educational and research institutions in Tsukuba Research Park City. A total of 5,868 (36.9%) and 9,528 (43.5%) respectively workers completed and returned the questionnaire; 5,715 and 9,515 respectively workers without missing data were analyzed. SOC scale scores differed by gender, age, and other demographic features in both study years. Among the life experiences, workers who have got over parenting or management position were higher SOC scale scores adjusted by gender and age. The life experiences that workers have got over could develop their stronger SOC in their life course.
De novo genome assembly is always fragmented. Assembly fragmentation is more serious using the popular next generation sequencing (NGS) data because NGS sequences are shorter than the traditional Sanger sequences. As the data throughput of NGS is high, the fragmentations in assemblies are usually not the result of missing data. On the contrary, the assembled sequences, called contigs, are often connected to more than one other contigs in a complicated manner, leading to the fragmentations. False connections in such complicated connections between contigs, named a contig graph, are inevitable because of repeats and sequencing/assembly errors. Simplifying a contig graph by removing false connections directly improves genome assembly. In this work, we have developed a tool, SIMGraph, to resolve ambiguous connections between contigs using NGS data. Applying SIMGraph to the assembly of a fungus and a fish genome, we resolved 27.6% and 60.3% ambiguous contig connections, respectively. These results can reduce the experimental efforts in resolving contig connections.
There are many situations where input feature vectors are incomplete and methods to tackle the problem have been studied for a long time. A commonly used procedure is to replace each missing value with an imputation. This paper presents a method to perform categorical missing data imputation from numerical and categorical variables. The imputations are based on Simpson-s fuzzy min-max neural networks where the input variables for learning and classification are just numerical. The proposed method extends the input to categorical variables by introducing new fuzzy sets, a new operation and a new architecture. The procedure is tested and compared with others using opinion poll data.
Ratio and regression type estimators have been used by previous authors to estimate a population mean for the principal variable from samples in which both auxiliary x and principal y variable data are available. However, missing data are a common problem in statistical analyses with real data. Ratio and regression type estimators have also been used for imputing values of missing y data. In this paper, six new ratio and regression type estimators are proposed for imputing values for any missing y data and estimating a population mean for y from samples with missing x and/or y data. A simulation study has been conducted to compare the six ratio and regression type estimators with a previous estimator of Rueda. Two population sizes N = 1,000 and 5,000 have been considered with sample sizes of 10% and 30% and with correlation coefficients between population variables X and Y of 0.5 and 0.8. In the simulations, 10 and 40 percent of sample y values and 10 and 40 percent of sample x values were randomly designated as missing. The new ratio and regression type estimators give similar mean absolute percentage errors that are smaller than the Rueda estimator for all cases. The new estimators give a large reduction in errors for the case of 40% missing y values and sampling fraction of 30%.
Missing data yields many analysis challenges. In case of complex survey design, in addition to dealing with missing data, researchers need to account for the sampling design to achieve useful inferences. Methods for incorporating sampling weights in neural network imputation were investigated to account for complex survey designs. An estimate of variance to account for the imputation uncertainty as well as the sampling design using neural networks will be provided. A simulation study was conducted to compare estimation results based on complete case analysis, multiple imputation using a Markov Chain Monte Carlo, and neural network imputation. Furthermore, a public-use dataset was used as an example to illustrate neural networks imputation under a complex survey design