International Science Index
Weighted-Distance Sliding Windows and Cooccurrence Graphs for Supporting Entity-Relationship Discovery in Unstructured Text
The problem of Entity relation discovery in structured
data, a well covered topic in literature, consists in searching within
unstructured sources (typically, text) in order to find connections
among entities. These can be a whole dictionary, or a specific
collection of named items. In many cases machine learning and/or
text mining techniques are used for this goal. These approaches
might be unfeasible in computationally challenging problems, such
as processing massive data streams. A faster approach consists in collecting the cooccurrences of any
two words (entities) in order to create a graph of relations - a
cooccurrence graph. Indeed each cooccurrence highlights some grade
of semantic correlation between the words because it is more common
to have related words close each other than having them in the
opposite sides of the text. Some authors have used sliding windows for such problem: they
count all the occurrences within a sliding windows running over the
whole text. In this paper we generalise such technique, coming up
to a Weighted-Distance Sliding Window, where each occurrence of
two named items within the window is accounted with a weight
depending on the distance between items: a closer distance implies
a stronger evidence of a relationship. We develop an experiment in
order to support this intuition, by applying this technique to a data
set consisting in the text of the Bible, split into verses.
Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity
Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.
Semantic Support for Hypothesis-Based Research from Smart Environment Monitoring and Analysis Technologies
Improvements in the data fusion and data analysis phase of research are imperative due to the exponential growth of sensed data. Currently, there are developments in the Semantic Sensor Web community to explore efficient methods for reuse, correlation and integration of web-based data sets and live data streams. This paper describes the integration of remotely sensed data with web-available static data for use in observational hypothesis testing and the analysis phase of research. The Semantic Reef system combines semantic technologies (e.g., well-defined ontologies and logic systems) with scientific workflows to enable hypothesis-based research. A framework is presented for how the data fusion concepts from the Semantic Reef architecture map to the Smart Environment Monitoring and Analysis Technologies (SEMAT) intelligent sensor network initiative. The data collected via SEMAT and the inferred knowledge from the Semantic Reef system are ingested to the Tropical Data Hub for data discovery, reuse, curation and publication.
Unsupervised Outlier Detection in Streaming Data Using Weighted Clustering
Outlier detection in streaming data is very challenging because streaming data cannot be scanned multiple times and also new concepts may keep evolving. Irrelevant attributes can be termed as noisy attributes and such attributes further magnify the challenge of working with data streams. In this paper, we propose an unsupervised outlier detection scheme for streaming data. This scheme is based on clustering as clustering is an unsupervised data mining task and it does not require labeled data, both density based and partitioning clustering are combined for outlier detection. In this scheme partitioning clustering is also used to assign weights to attributes depending upon their respective relevance and weights are adaptive. Weighted attributes are helpful to reduce or remove the effect of noisy attributes. Keeping in view the challenges of streaming data, the proposed scheme is incremental and adaptive to concept evolution. Experimental results on synthetic and real world data sets show that our proposed approach outperforms other existing approach (CORM) in terms of outlier detection rate, false alarm rate, and increasing percentages of outliers.
Content Based Sampling over Transactional Data Streams
This paper investigates the problem of sampling from transactional data streams. We introduce CFISDS as a content based sampling algorithm that works on a landmark window model of data streams and preserve more informed sample in sample space. This algorithm that work based on closed frequent itemset mining tasks, first initiate a concept lattice using initial data, then update lattice structure using an incremental mechanism.Incremental mechanism insert, update and delete nodes in/from concept lattice in batch manner. Presented algorithm extracts the final samples on demand of user. Experimental results show the accuracy of CFISDS on synthetic and real datasets, despite on CFISDS algorithm is not faster than exist sampling algorithms such as Z and DSS.
A New History Based Method to Handle the Recurring Concept Shifts in Data Streams
Recent developments in storage technology and
networking architectures have made it possible for broad areas of applications to rely on data streams for quick response and accurate
decision making. Data streams are generated from events of real world so existence of associations, which are among the occurrence of these events in real world, among concepts of data streams is
logical. Extraction of these hidden associations can be useful for prediction of subsequent concepts in concept shifting data streams. In this paper we present a new method for learning association among
concepts of data stream and prediction of what the next concept will be. Knowing the next concept, an informed update of data model will be possible. The results of conducted experiments show that the proposed method is proper for classification of concept shifting data
Optimal Convolutive Filters for Real-Time Detection and Arrival Time Estimation of Transient Signals
Linear convolutive filters are fast in calculation and in application, and thus, often used for real-time processing of continuous data streams. In the case of transient signals, a filter has not only to detect the presence of a specific waveform, but to estimate its arrival time as well. In this study, a measure is presented which indicates the performance of detectors in achieving both of these tasks simultaneously. Furthermore, a new sub-class of linear filters within the class of filters which minimize the quadratic response is proposed. The proposed filters are more flexible than the existing ones, like the adaptive matched filter or the minimum power distortionless response beamformer, and prove to be superior with respect to that measure in certain settings. Simulations of a real-time scenario confirm the advantage of these filters as well as the usefulness of the performance measure.
An Efficient Approach to Mining Frequent Itemsets on Data Streams
The increasing importance of data stream arising in a
wide range of advanced applications has led to the extensive study of
mining frequent patterns. Mining data streams poses many new
challenges amongst which are the one-scan nature, the unbounded
memory requirement and the high arrival rate of data streams. In this
paper, we propose a new approach for mining itemsets on data
stream. Our approach SFIDS has been developed based on FIDS
algorithm. The main attempts were to keep some advantages of the
previous approach and resolve some of its drawbacks, and
consequently to improve run time and memory consumption. Our
approach has the following advantages: using a data structure similar
to lattice for keeping frequent itemsets, separating regions from each
other with deleting common nodes that results in a decrease in search
space, memory consumption and run time; and Finally, considering
CPU constraint, with increasing arrival rate of data that result in
overloading system, SFIDS automatically detect this situation and
discard some of unprocessing data. We guarantee that error of results
is bounded to user pre-specified threshold, based on a probability
technique. Final results show that SFIDS algorithm could attain
about 50% run time improvement than FIDS approach.
A Cumulative Learning Approach to Data Mining Employing Censored Production Rules (CPRs)
Knowledge is indispensable but voluminous knowledge becomes a bottleneck for efficient processing. A great challenge for data mining activity is the generation of large number of potential rules as a result of mining process. In fact sometimes result size is comparable to the original data. Traditional data mining pruning activities such as support do not sufficiently reduce the huge rule space. Moreover, many practical applications are characterized by continual change of data and knowledge, thereby making knowledge voluminous with each change. The most predominant representation of the discovered knowledge is the standard Production Rules (PRs) in the form If P Then D. Michalski & Winston proposed Censored Production Rules (CPRs), as an extension of production rules, that exhibit variable precision and supports an efficient mechanism for handling exceptions. A CPR is an augmented production rule of the form: If P Then D Unless C, where C (Censor) is an exception to the rule. Such rules are employed in situations in which the conditional statement 'If P Then D' holds frequently and the assertion C holds rarely. By using a rule of this type we are free to ignore the exception conditions, when the resources needed to establish its presence, are tight or there is simply no information available as to whether it holds or not. Thus the 'If P Then D' part of the CPR expresses important information while the Unless C part acts only as a switch changes the polarity of D to ~D. In this paper a scheme based on Dempster-Shafer Theory (DST) interpretation of a CPR is suggested for discovering CPRs from the discovered flat PRs. The discovery of CPRs from flat rules would result in considerable reduction of the already discovered rules. The proposed scheme incrementally incorporates new knowledge and also reduces the size of knowledge base considerably with each episode. Examples are given to demonstrate the behaviour of the proposed scheme. The suggested cumulative learning scheme would be useful in mining data streams.
SUPAR: System for User-Centric Profiling of Association Rules in Streaming Data
With a surge of stream processing applications novel
techniques are required for generation and analysis of association
rules in streams. The traditional rule mining solutions cannot handle
streams because they generally require multiple passes over the data
and do not guarantee the results in a predictable, small time. Though
researchers have been proposing algorithms for generation of rules
from streams, there has not been much focus on their analysis.
We propose Association rule profiling, a user centric process for
analyzing association rules and attaching suitable profiles to them
depending on their changing frequency behavior over a previous
snapshot of time in a data stream.
Association rule profiles provide insights into the changing nature
of associations and can be used to characterize the associations. We
discuss importance of characteristics such as predictability of
linkages present in the data and propose metric to quantify it. We
also show how association rule profiles can aid in generation of user
specific, more understandable and actionable rules.
The framework is implemented as SUPAR: System for Usercentric
Profiling of Association Rules in streaming data. The
proposed system offers following capabilities:
i) Continuous monitoring of frequency of streaming item-sets
and detection of significant changes therein for association rule
ii) Computation of metrics for quantifying predictability of
associations present in the data.
iii) User-centric control of the characterization process: user
can control the framework through a) constraint specification and b)
non-interesting rule elimination.
Exponentially Weighted Simultaneous Estimation of Several Quantiles
In this paper we propose new method for
simultaneous generating multiple quantiles corresponding to given
probability levels from data streams and massive data sets. This
method provides a basis for development of single-pass low-storage
quantile estimation algorithms, which differ in complexity, storage
requirement and accuracy. We demonstrate that such algorithms may
perform well even for heavy-tailed data.
Cumulative Learning based on Dynamic Clustering of Hierarchical Production Rules(HPRs)
An important structuring mechanism for knowledge bases is building clusters based on the content of their knowledge objects. The objects are clustered based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy of classes that group similar events together. Hierarchical representation allows us to easily manage the complexity of knowledge, to view the knowledge at different levels of details, and to focus our attention on the interesting aspects only. One of such efficient and easy to understand systems is Hierarchical Production rule (HPRs) system. A HPR, a standard production rule augmented with generality and specificity information, is of the following form Decision If < condition> Generality Specificity . HPRs systems are capable of handling taxonomical structures inherent in the knowledge about the real world. In this paper, a set of related HPRs is called a cluster and is represented by a HPR-tree. This paper discusses an algorithm based on cumulative learning scenario for dynamic structuring of clusters. The proposed scheme incrementally incorporates new knowledge into the set of clusters from the previous episodes and also maintains summary of clusters as Synopsis to be used in the future episodes. Examples are given to demonstrate the behaviour of the proposed scheme. The suggested incremental structuring of clusters would be useful in mining data streams.