KDIR 2010 Abstracts


Full Papers
Paper Nr: 8
Title:

AUTOMATIC SPATIAL PLAUSIBILITY CHECKS FOR MEDICAL OBJECT RECOGNITION RESULTS USING A SPATIO-ANATOMICAL ONTOLOGY

Authors:

Manuel Möller and Patrick Ernst

Abstract: We present an approach to use medical expert knowledge represented in formal ontologies to check the results of automatic medical object recognition algorithms for spatial plausibility. Our system is based on the comprehensive Foundation Model of Anatomy ontology which we extend with spatial relations between a number of anatomical entities. These relations are learned inductively from an annotated corpus of 3D volume data sets. The induction process is split into two parts: First, we generate a quantitative anatomical atlas using fuzzy sets to represent inherent imprecision. From this atlas we abstract onto a purely symbolic level to generate a generic qualitative model of the spatial relations in human anatomy. In our evaluation we describe how this model can be used to check the results of a state-of-the-art medical object recognition system for 3D CT volume data sets for spatial plausibility. Our results show that the combination of medical domain knowledge in formal ontologies and sub-symbolic object recognition yields improved overall recognition precision.

Paper Nr: 15
Title:

ASSESSING PROGRESSIVE FILTERING TO PERFORM HIERARCHICAL TEXT CATEGORIZATION IN PRESENCE OF INPUT IMBALANCE

Authors:

Andrea Addis

Abstract: The more the amount of available data (e.g., in digital libraries), the greater the need for high-performance text categorization algorithms. So far, the work on text categorization has been mostly focused on “flat” approaches, i.e., algorithms that operate on non-hierarchical classification schemes. Hierarchical approaches are expected to perform better in presence of subsumption ordering among categories. In fact, according to the “divide et impera” strategy, they partition the problem into smaller subproblems, each being expected to be simpler to solve. In this paper, we illustrate and discuss the results obtained by assessing the “Progressive Filtering” (PF) technique, used to perform text categorization. Experiments, on the Reuters Corpus (RCV1- v2) and on DZMOZ datasets, are focused on the ability of PF to deal with input imbalance. In particular, the baseline is: (i) comparing the results to those calculated resorting to the corresponding flat approach; (ii) calculating the improvement of performance while augmenting the pipeline depth; and (iii) measuring the performance in terms of generalization- / specialization- / misclassification-error and unknown-ratio. Experimental results show that, for the adopted datasets, PF is able to counteract great imbalances between negative and positive examples.

Paper Nr: 19
Title:

EVOLUTIVE CONTENT-BASED SEARCH SYSTEM - Semantic Search System based on Case-based-Reasoning and Ontology Enrichment

Authors:

Manel Elloumi-Chaabene, Nesrine Ben Mustapha and Hajer Baazaoui-Zghal

Abstract: The exponential growth of the data available on the Web requires more efficient search tools to find relevant information. In the context of the Semantic Web, ontologies improve the exploitation of Web resources by adding a consensual knowledge. However, the automation of ontology building is a hard problem. The exploitation of the users’ search feedback can aid to address that problem. In order to apply this solution, we present a semantic search system based on case-based-reasoning (CBR) and ontologies that automatically enriches the ontologies by using previous queries. In this paper we expose the contribution of CBR and ontologies in Semantic Web search. Some experiments on the system are presented, and the obtained results show an improvement on the precision of the Web search and ontology enrichment.

Paper Nr: 20
Title:

GENERATING LITERATURE-BASED KNOWLEDGE DISCOVERIES IN LIFE SCIENCES USING RELATIONSHIP ASSOCIATIONS

Authors:

Steven B. Kraines, Weisen Guo and Daisuke Hoshiyama

Abstract: The life sciences have been a pioneering discipline for the field of knowledge discovery, since the literature-based discoveries by Swanson three decades ago. Existing literature-based knowledge discovery techniques generally try to discover hitherto unknown associations of domain concepts based on associations that can be established from the literature. However, scientific facts are more often expressed as specific relationships between concepts and/or entities that have been established through scientific research. A pair of relationships that predicate the specific way in which one concept relates to another can be associated if one of the concepts from each relationship can be determined to be semantically equivalent; we call this a “relationship association”. Then, by making the same assumption of the transitivity of association used by Swanson and others, we can generate a hypothetical relationship association by combining two relationship associations that have been extracted from a knowledge base. Here we describe an algorithm for generating potential knowledge discoveries in the form of new relationship associations that are implied but not actually stated, and we test the algorithm against a corpus of almost 5000 relationship associations that we have extracted in previous work from 392 semantic graphs representing research articles from MEDLINE.

Paper Nr: 23
Title:

SIMPLE AND EFFICIENT PROJECTIVE CLUSTERING

Authors:

Clark F. Olson and Henry J. Lyons

Abstract: We describe a new Monte Carlo algorithm for projective clustering that is both simple and efficient. Like previous Monte Carlo algorithms, we perform trials that sample a small subset of the data points to determine the dimensions in which the points are sufficiently close to form a cluster and then search the rest of the data for data points that are part of the cluster. However, our algorithm differs from previous algorithms in the method by which the dimensions of the cluster are determined and the method for determining the points in the cluster. This allows us to use smaller subsets of the data to determine the cluster dimensions and achieve improved efficiency over previous algorithms. The complexity of our algorithm is O(nd1+loga/logb), where n is the number of data points, d is the number of dimensions in the space, and a and b are parameters that specify which clusters should be found. To our knowledge, this is the lowest published complexity for an algorithm that is able to place high bounds on the probability of success. We present experiments that show that our algorithm outperforms previous algorithms on real and synthetic data.

Paper Nr: 32
Title:

DISCOVERING CRITICAL SITUATIONS IN ONLINE SOCIAL NETWORKS - A Neuro Fuzzy Approach to Alert Marketing Managers

Authors:

Carolin Kaiser

Abstract: More and more people are exchanging their opinions in online social networks and influencing each other. It is crucial for companies to observe opinion formation concerning their products. Thus, risks can be recognized early on and counteractive measures can be initiated by marketing managers. A neuro fuzzy approach is presented which allows the detection of critical situations in the process of opinion formation and the alerting of marketing managers. Rules for identifying critical situations are learned on the basis of the opinions of the network members, the influence of the opinion leaders and the structure of the network. The opinions and characteristics of the network are identified by text mining and social network analysis. The approach is illustrated by an exemplary application.

Paper Nr: 33
Title:

INTERPRETING EXTREME LEARNING MACHINE AS AN APPROXIMATION TO AN INFINITE NEURAL NETWORK

Authors:

Elina Parviainen and Jaakko Riihimäki

Abstract: Extreme Learning Machine (ELM) is a neural network architecture in which hidden layer weights are randomly chosen and output layer weights determined analytically. We interpret ELM as an approximation to a network with infinite number of hidden units. The operation of the infinite network is captured by neural network kernel (NNK). We compare ELM and NNK both as part of a kernel method and in neural network context. Insights gained from this analysis lead us to strongly recommend model selection also on the variance of ELM hidden layer weights, and not only on the number of hidden units, as is usually done with ELM. We also discuss some properties of ELM, which may have been too strongly interpreted in previous works.

Paper Nr: 61
Title:

PROXIMITY-BASED GRAPH EMBEDDINGS FOR MULTI-LABEL CLASSIFICATION

Authors:

Tingting Mu and Sophia Ananiadou

Abstract: In many real applications of text mining, information retrieval and natural language processing, large-scale features are frequently used, which often make the employed machine learning algorithms intractable, leading to the well-known problem “curse of dimensionality”. Aiming at not only removing the redundant information from the original features but also improving their discriminating ability, we present a novel approach on supervised generation of low-dimensional, proximity-based, graph embeddings to facilitate multi-label classification. The optimal embeddings are computed from a supervised adjacency graph, called multi-label graph, which simultaneously preserves proximity structures between samples constructed based on feature and multi-label class information. We propose different ways to obtain this multi-label graph, by either working in a binary label space or a projected real label space. To reduce the training cost in the dimensionality reduction procedure caused by large-scale features, a smaller set of relation features between each sample and a set of representative prototypes are employed. The effectiveness of our proposed method is demonstrated with two document collections for text categorization based on the “bag of words” model.

Paper Nr: 75
Title:

VISUALLY SUMMARIZING THE EVOLUTION OF DOCUMENTS UNDER A SOCIAL TAG

Authors:

André Gohr

Abstract: Tags are intensively used in social platforms to annotate resources: Tagging is a social phenomenon, because users do not only annotate to organize their resources but also to associate semantics to resources contributed by third parties. This leads often to semantic ambiguities: Popular tags are associated with very disparate meanings, even to the extend that some tags (e.g. ”beautiful” or ”toread”) are irrelevant to the semantics of the resources they annotate. We propose a method that learns a topic model for documents under a tag and visualizes the different meanings associated with the tag. Our approach deals with the following problems. First, tag miscellany is a temporal phenomenon: tags acquire multiple semantics gradually, as users apply them to disparate documents. Hence, our method must capture and visualize the evolution of the topics in a stream of documents. Second, the meanings associated to a tag must be presented in a human-understandable way; This concerns both the choice of words and the visualization of all meanings. Our method uses AdaptivePLSA, a variation of Probabilistic Latent Semantic Analysis for streams, to learn and adapt topics on a stream of documents annotated with a specific tag. We propose a visualization technique called Topic Table to visualize document prototypes derived from topics and their evolution over time. We show by a case study how our method captures the evolution of tags selected as frequent and ambiguous, and visualizes their semantics in a comprehensible way. Additionally, we show the effectiveness by adding alien resources under a tag. Our approach indeed visualizes hints to the added documents.

Paper Nr: 92
Title:

FEATURE SELECTION FOR THE INSTRUMENTED TIMED UP AND GO IN PARKINSON’S DISEASE

Authors:

L. Palmerini, L. Rocchi and S. Mellone

Abstract: The Timed Up and Go (TUG) is a widely used clinical test to assess mobility and fall risk in Parkinson’s disease (PD). The traditional outcome of this test is its duration. Since this single measure cannot provide insight on subtle differences in test performances, we considered an instrumented TUG (iTUG). The aim was to find, by means of a feature selection, the best set of quantitative measures that would allow an objective evaluation of gait function in PD. We instrumented the TUG using a triaxial accelerometer. Twenty early-mild PD and twenty age-matched control subjects performed normal and dual task TUG trials. Several temporal, coordination and smoothness measures were extracted from the acceleration signals; a wrapper feature selection was implemented for different classifiers with an exhaustive search for subsets from 1 to 3 features. A leave-one-out cross validation (LOOCV) was implemented both for the feature selection and for the evaluation of the classifier, resulting in a nested LOOCV. The resulting selected features permit to obtain a good accuracy (7.5% of misclassification rate) in the classification of PD. Interestingly the traditional TUG duration was not selected in any of the best subsets.

Paper Nr: 95
Title:

CrossSense - Sensemaking in a Folksonomy with Cross-modal Clustering over Content and User Activities

Authors:

Hans-Henning Gabriel

Abstract: Today folksonomies are of increasing importance, many different platforms emerged and millions of people use them. We consider the case of a user who enters such a social platform and wants to get an overview of a particular domain. The folksonomy provides abundant information for that task in the form of documents, tags on them and users who contribute documents and tags. We propose a process that identifies a small number of thematically ”interesting objects” with respect to subject domains. Our novel algorithm CrossSense builds clusters composed of objects of different types upon a data tensor. It then selects pivot objects that are characteristic of one cluster and are associated with many objects of different types from the clusters. Then, CrossSense collects all the folksonomy content that is associated with a pivot object, i.e. the object’s world: We rank pivot objects and present the top ones to the user. We have experimented with Bibsonomy data against a baseline that selects the most popular users, documents and tags, accompanied by the objects most frequently co-occurring with them. Our experiments show that our pivot objects exhibit more homogeneity and constitute a smaller set of entities to be inspected by the user.

Paper Nr: 107
Title:

DYNAMIC QUERY EXPANSION BASED ON USER’S REAL TIME IMPLICIT FEEDBACK

Authors:

Sanasam Ranbir Singh

Abstract: Majority of the queries submitted to search engines are short and under-specified. Query expansion is a commonly used technique to address this issue. However, existing query expansion frameworks have an inherent problem of poor coherence between expansion terms and user’s search goal. User’s search goal, even for the same query, may be different at different instances. This often leads to poor retrieval performance. In many instances, user’s current search is influenced by his/her recent searches. In this paper, we study a framework which explores user’s implicit feedback provided at the time of search to determine user’s search context. We then incorporate the proposed framework with query expansion to identify relevant query expansion terms. From extensive experiments, it is evident that the proposed framework can capture the dynamics of user’s search and adapt query expansion accordingly.

Paper Nr: 129
Title:

UNWANTED BEHAVIOUR DETECTION AND CLASSIFICATION IN NETWORK TRAFFIC

Authors:

İsmail Melih Önem

Abstract: An Intrusion Detection System classifies activities at an unwanted intention and can log or prevent activities that are marked as intrusions. Intrusions occur when malicious activity and unwanted behaviour gain access to or affect the usability of a computer resource. During the last years, anomaly discovery has attracted the attention of many researchers to overcome the disadvantage of signature-based IDSs in discovering novel attacks, and KDDCUP’99 is the mostly widely used data set for the evaluation of these systems. Difficulty is discovering unwanted behaviour in network traffic after they have been subject to machine learning methods and processes. The goal of this research is using the SVM machine learning model with different kernels and different kernel parameters for classification unwanted behaviour on the network with scalable performance. The SVM model enables flexible, flow-based method for detecting unwanted behaviour and illustrates its use in the context of an incident, and can forward the design and deployment of improved techniques for security scanning. Although scalability and performance are major considerations and results also are targeted at minimizing false positives and negatives. The classification matured in this paper is used for improving SVM computational efficiency to detect intrusions in each category, and enhanced model is presented experimental results based on an implementation of the model tested against real intrusions.

Short Papers
Paper Nr: 2
Title:

WEB INFORMATION GATHERING TASKS - A Framework and Research Agenda

Authors:

Anwar Alhenshiri and Carolyn Watters

Abstract: This paper provides in-depth analysis of Web information gathering tasks. Research has focused on categorizing Web tasks by creating a high-level framework of user goals and activities on the Web. Yet, there has been very limited emphasis on improving the effectiveness of Web search for information gathering under the concept of a complete task. This paper provides a framework in which subtasks underlying the overall task of Web information gathering are considered. Moreover, the paper provides research recommendations for techniques concerning collecting and gathering information on the Web.

Paper Nr: 4
Title:

STATISTICAL METHODS FOR THE EVALUATION OF INDEXING PHRASES

Authors:

Antoine Doucet and Helena Ahonen-Myka

Abstract: In this paper, we review statistical techniques for the direct evaluation of descriptive phrases and introduce a new technique based on mutual information. In the experiments, we apply this technique to different types of frequent sequences, hereby finding mathematical justification of former empirical practice.

Paper Nr: 5
Title:

AN EFFICIENT PSO-BASED CLUSTERING ALGORITHM

Authors:

Chun-Wei Tsai and Ko-Wei Huang

Abstract: Recently, particle swarm optimization (PSO) has become one of the most popular approaches to clustering problems because it can provide a higher quality result than deterministic local search method. The problem of PSO in solving clustering problems, however, is that it is much slower than deterministic local search method. This paper presents a novel method to speed up its performance for the partitional clustering problem—based on the idea of eliminating computations that are essentially redundant during its convergence process. In addition, the multistart strategy is used to improve the quality of the end result. To evaluate the performance of the proposed method, we compare it with several state-of-the-art methods in solving the data and image clustering problems. Our simulation results indicate that the proposed method can reduce from about 60% up to 90% of the computation time of the k-means and PSO-based algorithms to find similar or even better results.

Paper Nr: 14
Title:

MACHINE LEARNING AND LINK ANALYSIS FOR WEB CONTENT MINING

Authors:

Moreno Carullo and Elisabetta Binaghi

Abstract: In this work we define a hybrid Web Content Mining strategy aimed to recognize within Web pages the main entity, intended as the short text that refers directly to the main topic of a given page. The salient aspect of the strategy is the use of a novel supervised Machine Learning model able to represent in an unified framework the integrated use of visual pages layout features, textual features and hyperlink description. The proposed approach has been evaluated with promising results.

Paper Nr: 16
Title:

ANSWER GRAPH CONSTRUCTION FOR KEYWORD SEARCH ON GRAPH STRUCTURED(RDF) DATA

Authors:

K. Parthasarathy

Abstract: Keyword search is an easy way to allow inexperienced users to query an information system. It does not need knowledge of specific query language or underlying schema. Recently answering keyword queries on graph structured data has emerged as an important research topic. Many efforts focus on queries on RDF(Resource Description Framework) graphs as RDF has emerged as a viable data model for representing/integrating semistructured, distributed and interconnected data. In this paper, we present a novel algorithm for constructing answer graphs using pruned exploration strategy. We form component structures comprising closely related class and relationship nodes for the keywords and join the identified component structures using appropriate hook nodes. The Class/SubClass relationships available in RDF schema are also utilized for the answer graph construction. The paper illustrates the working of the algorithm using AIFB institute data.

Paper Nr: 28
Title:

COLLECTIVE BEHAVIOUR IN INTERNET - Tendency Analysis of the Frequency of User Web Queries

Authors:

Joan Codina-Filba and David F. Nettleton

Abstract: In this paper we propose a classification for different observable trends over time for user web queries. The focus is on the identification of general collective trends, based on search query keywords, of the user community in Internet and how they behave over a given time period. We give some representative examples of real search queries and their tendencies. From these examples we define a set of descriptive features which can be used as inputs for data modelling. Then we use a selection of non supervised (clustering) and supervised modelling techniques to classify the trends. The results show that it is relatively easy to classify the basic hypothetical trends we have defined, and we identify which of the chosen learning techniques are best able to model the data. However, the presence of more complex, noisy or mixed trends make the classification more difficult.

Paper Nr: 38
Title:

TACTICAL ANALYSIS MODELING THROUGH DATA MINING - Pattern Discovery in Racket Sports

Authors:

Antonio Terroba Acha

Abstract: We explore pattern discovery within the game of tennis. To this end, we formalize events in a match, and define similarities for events and event sequences.We then proceed by looking at unbalancing events and their immediate prequel (using pattern masks) and sequel (using nondeterministic finite automata). Structured in this way, the data can be effectively mined, and a similar approach might also be applied to more general areas. We show that data mining is able to find interesting patterns in real-world data from tennis matches.

Paper Nr: 41
Title:

EXPLOITING N-GRAM IMPORTANCE AND WIKIPEDIA BASED ADDITIONAL KNOWLEDGE FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

Authors:

Niraj Kumar and Venkata Vinay Babu Vemula

Abstract: This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document clustering with lesser human involvement, accompanied by effective improvements in result?” In the devised system, we propose a method to exploit the importance of N-grams in a document and use Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams in a document depends on a many features including, but not limited to: frequency, position of their occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we introduce a new similarity measure, which takes the weighted N-gram importance into account, in the calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area.

Paper Nr: 42
Title:

INFORMATION RETRIEVAL FROM HISTORICAL DOCUMENT IMAGE BASE

Authors:

Khurram Khurshid and Imran Siddiqi

Abstract: This communication presents an effective method for information retrieval from historical document image base. Proposed approach is based on word and character extraction in the text and attributing certain feature vectors to each of the character images. Words are matched by comparing their characters through a multi-stage Dynamic Time warping (DTW) stage on the extracted feature set. The approach exhibits extremely promising results reading more than 96% retrieval/recognition rate.

Paper Nr: 43
Title:

NUMBER THEORY-BASED INDUCTION OF DETERMINISTIC CONTEXT-FREE L-SYSTEM GRAMMAR

Authors:

Ryohei Nakano and Naoya Yamada

Abstract: This paper addresses grammatical induction of deterministic context-free L(D0L)-system. Considering the parallel feature of L-system production and the deterministic context-free feature of D0L-system, we take a number theory-based approach. Here D0L-system grammar is limited to one or two production rules. Basic equations for the methods are derived and utilized to narrow down the parameter value ranges. Our experiments using plants models showed the proposed methods induced the original production rules very efficiently.

Paper Nr: 44
Title:

A TRADEOFF BALANCING ALGORITHM FOR HIDING SENSITIVE FREQUENT ITEMSETS

Authors:

Harun Gökçe and Osman Abul

Abstract: Sensitive frequent itemset hiding problem is typically solved by applying a sanitization process which transforms the source database into a release version. The main challenge in the process is to preserve the database utility while ensuring no sensitive knowledge is disclosed, directly or indirectly. Several algorithmic solutions based on different approaches are proposed to solve the problem. We observe that the available algorithms are like seesaws as far as both effectiveness and efficiency performances are considered. However, most practical domains demand for solutions with satisfactory effectiveness/efficiency performances, i.e., solutions balancing the tradeoff between the two. Motivated from this observation, in this paper, we present yet a simple and practical frequent itemset hiding algorithm targeting the balanced solutions. Experimental evaluation, on two datasets, shows that the algorithm indeed achieves a good balance between the two performance criteria.

Paper Nr: 45
Title:

AUTHOR ATTRIBUTION EVALUATION WITH NOVEL TOPIC CROSS-VALIDATION

Authors:

Andrew I. Schein and Johnnie F. Caver

Abstract: The practice of using statistical models in predicting authorship (so-called author attribution models) is long established. Several recent authorship attribution studies have indicated that topic-specific cues impact author attribution machine learning models. The arrival of new topics should be anticipated rather than ignored in an author attribution evaluation methodology; a model that relies heavily on topic cues will be problematic in deployment settings where novel topics are common. We develop a protocol and test bed for measuring sensitivity to topic cues using a methodology called novel topic cross-validation. Our methodology performs a cross-validation where only topics unseen in training data are used in the test portion. Analysis of the testing framework suggests that corpora with large numbers of topics lead to more powerful hypothesis testing in novel topic evaluation studies. In order to implement the evaluation metric, we developed two subsets of the New York Times Annotated Corpus including one with 15 authors and 23 topics. We evaluated a maximum entropy classifier in standard and novel topic cross validation in order to compare the mechanics of the two procedures. Our novel topic evaluation framework supports automatic learning of stylometric cues that are topic neutral, and our test bed is reproducible using document identifiers available from the authors.

Paper Nr: 48
Title:

AN ALGORITHM FOR DECISION RULES AGGREGATION

Authors:

Adam Gudys and Marek Sikora

Abstract: Decision trees and decision rules are usually applied for the classification problems in which legibility and possibility of interpretation of the obtained data model is important as well as good classification abilities. Beside trees, rules are the most frequently used knowledge representation applied by knowledge discovery algorithms. Rules generated by traditional algorithms use conjunction of simple conditions, each dividing input space by a hyperplane parallel to one of the hyperplanes of the coordinate system. There are problems for which such an approach results in a huge set of rules that poorly models real dependencies in data, is susceptible for overtfitting and hard to understand by human. Generating decision rules containing more complicated conditions may improve quality and interpretability of a rule set. In this paper an algorithm taking a set of traditional rules and aggregating them in order to obtain a smaller set of more complex rules has been presented. As procedure uses convex hulls, it has been called Convex Hull-Based Iterative Aggregation Algorithm.

Paper Nr: 52
Title:

MULTI-MODAL ANALYSIS OF COMPLEX NETWORK - Point Stimulus Response Depending on its Location in the Network

Authors:

Takeshi Ozeki

Abstract: We report a new method of diagnosis of a node in a network by “Point Stimulus Response”. The “Point Stimulus Response” corresponds to the impulse response of the network, that is, the state temporal variation in the Markov transition with the delta-function of initial state. We can evaluate the reaction of the system against a point stimulus such as a point failure. In this report, for the first, we summarize our mathematical platform for analysing complex network system using the adjacency matrix as the transition matrix in Markov transition approximation. On this basis, we formulate the point stimulus response. The location dependence of the point stimulus response is demonstrated in Tokyo Metropolitan Railway Network System. For a concrete example, the total amount of suffered passengers and time response of recovery from a point failure will be discussed depending on the location of point failure in the network system. It can be said that a way to find a point for effective stimulus response is one of key approaches for knowledge discovery. However, the real indication or meaning of the point stimulus is in the stage of speculation.

Paper Nr: 54
Title:

A METHOD TO GENERATE A REDUCED TRAINING SET FOR FASTER AND BETTER NEAREST NEIGHBOR CLASSIFICATION

Authors:

P. Viswanath

Abstract: Classification time and space requirements of nearest neighbor based classifiers depends directly on the training set size. There exist several ways to reduce the training set size, and also there exist some methods to generate artificial training sets which are aimed at achieving better classification accuracy. These are often called bootstrap methods. The paper proposes a method which tries to achieve both of these objectives, namely, improving the performance by generating a bootstrapped training set and reducing the training set size by eliminating some irrelevant training patterns. The proposed method is a faster one than similar recent methods and runs in a linear time of the training set size. The method first will find a clustering in a class of training patterns using the c-means clustering method to derive the c mean patterns, then for each pattern, a new pattern is derived by taking a weighted combination of the pattern with its mean. This smooths the boundary between classes in the feature space, hence can act as a regularization step. Along with this, a threshold distance is set and all patterns that fall within this distance from a mean pattern are removed from the training set. Since these are mostly the interior patterns, their removal will not affect the boundary between the classes. Experimentally the proposed method is compared against recent relevant methods and are shown to be effective and faster than them. The proposed method is a suitable one to work with large data sets like those in data mining.

Paper Nr: 62
Title:

A CLASS SPECIFIC DIMENSIONALITY REDUCTION FRAMEWORK FOR CLASS IMBALANCE PROBLEM: CPC SMOTE

Authors:

T. Maruthi Padmaja

Abstract: The performance of the conventional classification algorithms deteriorates due to the class imbalance problem, which occurs when one class of data severely outnumbers the other class. On the other hand the data dimensionality also plays a crucial role in performance deterioration of classification algorithms. Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction. Due to unsupervised nature of PCA, it is not adequate enough to hold class discriminative information for classification problems. In case of unbalanced datasets the occurrence of minority class samples are rare or obtaining them are costly. Moreover, the misclassification cost associated with minority class samples is higher than non-minority class samples. Capturing and validating labeled samples, particularly minority class samples, in PCA subspace is an important issue. We propose a class specific dimensionality reduction and oversampling framework named CPC SMOTE to address this issue. The framework is based on combining class specific PCA subspaces to hold informative features from minority as well as majority class and oversample the combined class specific PCA subspace to compensate lack of data problem. We evaluated the proposed approach using 1 simulated and 5 UCI repository datasets. The evaluation show that the framework is effective when compared to PCA and SMOTE preprocessing methods.

Paper Nr: 68
Title:

QUERY PROCESSING FOR ENTERPRISE SEARCH WITH WIKIPEDIA LINK STRUCTURE

Authors:

Nihar Sharma and Vasudeva Varma

Abstract: We present a phrase based query expansion (QE) technique for enterprise search using a domain independent concept thesaurus constructed from Wikipedia link structure. Our approach analyzes article and category link information for deriving sets of related concepts for building up the thesaurus. In addition, we build a vocabulary set containing natural word order and usage which semantically represent concepts. We extract query-representational concepts from vocabulary set with a three layered approach. Concept Thesaurus then yields related concepts for expanding a query. Evaluation on TRECENT 2007 data shows an impressive 9 percent increase in recall for fifty queries. In addition to we also observed that our implementation improves precision at top k results by 0.7, 1, 6 and 9 percent for top 10, top 20, top 50 and top 100 search results respectively, thus demonstrating the promise that Wikipedia based thesaurus holds in domain specific search.

Paper Nr: 71
Title:

KERNEL OVERLAPPING K-MEANS FOR CLUSTERING IN FEATURE SPACE

Authors:

Chiheb-Eddine Ben N'Cir

Abstract: Producing overlapping schemes is a major issue in clustering. Recent overlapping methods rely on the search of optimal clusters and are based on different metrics, such as Euclidean distance and I-Divergence, used to measure closeness between observations. In this paper, we propose the use of kernel methods to look for separation between clusters in a high feature space. For detecting non linearly separable clusters, we propose a Kernel Overlapping k-Means algorithm (KOKM) in which we use kernel induced distance measure. The number of overlapping clusters is estimated using the Gram matrix. Experiments on different datasets show the correctness of the estimation of number of clusters and show that KOKM gives better results when compared to overlapping k-means.

Paper Nr: 73
Title:

A META-LEARNING METHOD FOR CONCEPT DRIFT

Authors:

Runxin Wang and Lei Shi

Abstract: The knowledge hidden in evolving data may change with time, this issue is known as concept drift. It often causes a learning system to decrease its prediction accuracy. Most existing techniques apply ensemble methods to improve learning performance on concept drift. In this paper, we propose a novel meta learning approach for this issue and develop a method: Multi-Step Learning (MSL). In our method, a MSL learner is structured in a recursive manner, which contains all the base learners maintained in a hierarchy, ensuring the learned concepts are traceable. We evaluated MSL and two ensemble techniques on three synthetic datasets, which contain a number of drastic concept drifts. The experimental results show that the proposed method generally performs better than the ensemble techniques in terms of prediction accuracy.

Paper Nr: 74
Title:

FILTERING ASSOCIATION RULES WITH NEGATIONS ON THE BASIS OF THEIR CONFIDENCE BOOST

Authors:

José L. Balcázar

Abstract: We consider a recent proposal to filter association rules on the basis of their novelty: the confidence boost. We develop appropriate mathematical tools to understand it in the presence of negated attributes, and explore the effect of applying it to association rules with negations. We show that, in many cases, the notion of confidence boost allows us to obtain reasonably sized output consisting of intuitively interesting association rules with negations.

Paper Nr: 78
Title:

AGGREGATION OF IMPLICIT FEEDBACKS FROM SEARCH ENGINE LOG FILES

Authors:

Ashok Veilumuthu and Parthasarathy Ramachandran

Abstract: The current approaches to information retrieval from the search engine depends heavily on the web linkage structure which is a form of relevance judgment by the page authors. However, to overcome spamming attempts and language semantics, it is important to also incorporate the user feedback on the documents’ relevance for a particular query. Since users can be hardly motivated to give explicit/direct feedback on search quality, it becomes necessary to consider implicit feedback that can be collected from search engine logs. Though there are number of implicit feedback measures proposed to improve the search quality, there is no standard methodology proposed yet to aggregate those implicit feedbacks meaningfully to get a final ranking of he documents. In this article, we propose an extension to the distance based ranking model to aggregate different implicit feedbacks based on their expertise in ranking the documents. The proposed approach has been tested on two implicit feedbacks, namely click sequence and time spent in reading a document from the actual log data of AlltheWeb.com. The results were found to be convincing and indicative of the possibility of expertise based fusion of implicit feedbacks to arrive at a single ranking of documents for the given query.

Paper Nr: 80
Title:

RECOGNITION OF GENE/PROTEIN NAMES USING CONDITIONAL RANDOM FIELDS

Authors:

David Campos

Abstract: With the overwhelming amount of publicly available data in the biomedical field, traditional tasks performed by expert database annotators rapidly became hard and very expensive. This situation led to the development of computerized systems to extract information in a structured manner. The first step of such systems requires the identification of named entities (e.g. gene/protein names), a task called Named Entity Recognition (NER). Much of the current research to tackle this problem is based on Machine Learning (ML) techniques, which demand careful and sensitive definition of the several used methods. This article presents a NER system using Conditional Random Fields (CRFs) as the machine learning technique, combining the best techniques recently described in the literature. The proposed system uses biomedical knowledge and a large set of orthographic and morphological features. An F-measure of 0,7936 was obtained on the BioCreative II Gene Mention corpus, achieving a significantly better performance than similar baseline systems.

Paper Nr: 86
Title:

TIME CONSTRAINTS EXTENSION ON FREQUENT SEQUENTIAL PATTERNS

Authors:

A. Ben Zakour and M. Sistiaga

Abstract: Unlike frequent sets extraction for which only minimum support condition must be met, sequential patterns satisfy time constraints. Commonly, to consider two events as successive, these constraints are either to respect minimum and maximum time gap or to be included into a window size. In this paper, we introduce a new definition of “interesting sequences”. This property suggests that temporal patterns, introducing concepts of sliding window, can be customized by the user so that the events chronology in the extracted sequences has not to strictly obey to the original event sequence.This definition is incorporated in the process of a conventional algorithm (Fournier-Viger et al., 2008). The extracted patterns have an interval time stamp form and represent an interesting palette of the original data.

Paper Nr: 93
Title:

A CONTEXT-BASED MODEL FOR WEB QUERY REFORMULATION

Authors:

Ounas Asfari and Bich-Liên Doan

Abstract: Access to relevant information adapted to the needs and the context of the user is a real challenge in the Web Search, owing to the increase of heterogeneous resources on the web. In most of cases, user queries are shortened and ambiguous, thus we need to handle implicit needs or intentions that are behind these queries. For improving user query processing, we present a context-based method for query expansion that automatically generates context-related terms. Here, we consider the user context as the current state of the task that the user is undertaking when the information retrieval process takes place, thus State Reformulated Queries (SRQ) are generated according to the user task state and the ontological user profile to provide personalized results in a particular context.

Paper Nr: 94
Title:

RANKING CLASSES OF SEARCH ENGINE RESULTS

Authors:

Zheng Zhu

Abstract: Ranking search results is an ongoing research topic in information retrieval. The traditional models are the vector space, probabilistic and language models, and more recently machine learning has been deployed in an effort to learn how to rank search results. Categorization of search results has also been studied as a means to organize the results, and hence to improve users search experience. However there is little research to-date on ranking categories of results in comparison to ranking the results themselves. In this paper, we propose a probabilistic ranking model that includes categories in addition to a ranked results list, and derive six ranking methods from the model. These ranking methods utilize the following features: the class probability distribution based on query classification, the lowest ranked document within each class and the class size. An empirical study was carried out to compare these methods with the traditional ranked-list approach in terms of rank positions of click-through documents and experimental results show that there is no simpler winner in all cases. Better performance is attained by class size or a combination of the class probability distribution of the queries and the rank of the document with the lowest list rank within the class.

Paper Nr: 97
Title:

DOES CAPITALIZATION MATTER IN WEB SEARCH?

Authors:

Silviu Cucerzan

Abstract: We investigate the capitalization features of queries submitted to Web search engines and the relation between capitalization information, either as received from users or as hypothesized based on Web statistics, and search relevance. We observe that users tend to lowercase words in their queries significantly more often than as predicted from Web data. More importantly, we determine that document relevance is strongly correlated with the matching in capitalization between the instances of query tokens in the target document and the tokens of the truecased form of the query as obtained by using Web n-gram data.

Paper Nr: 98
Title:

APPLICATION OF COMBINATORIAL METHODS TO PROTEIN IDENTIFICATION IN PEPTIDE MASS FINGERPRINTING

Authors:

Leonid Molokov

Abstract: Peptide Mass Fingerprinting (PMF) for long has been a widely used and reliable method for protein identification. However it faced several problems, the most important of which is inability of classical methods to deal with protein mixtures. To cope with this problem, more costly experimental techniques are employed. We investigate, whether it is possible to extract more information from PMF by more thorough data analysis. To do this, we propose a novel method to remove noise from the data and show how the results can be interpreted in a different way. We also provide simulation results suggesting our method can be used for analysis of small mixtures.

Paper Nr: 102
Title:

CLUSTERING OF THREAD POSTS IN ONLINE DISCUSSION FORUMS

Authors:

Dina Said and Nayer Wanas

Abstract: Online discussion forums are considered a challenging repository for data mining tasks. Forums usually contain hundreds of threads which which in turn maybe composed of hundreds, or even thousands, of posts. Clustering these posts potentially will provide better visualization and exploration of online threads. Moreover, clustering can be used for discovering outlier and off-topic posts. In this paper, we propose the Leader-based Post Clustering (LPC), a modification to the Leader algorithm to be applied to the domain of clustering posts in threads of discussion boards. We also suggest using asymmetric pair-wise distances to measure the dissimilarity between posts. We further investigate the effect of indirect distance between posts, and how to calibrate it with the direct distance. In order to evaluate the proposed methods, we conduct experiments using artificial and real threads extracted from Slashdot and Ciao discussion forums. Experimental results demonstrate the effectiveness of the LPC algorithm when using the linear combination of direct and indirect distances, as well as using an averaging approach to evaluate a representative indirect distance.

Paper Nr: 111
Title:

AUTOMATIC IDENTIFICATION OF BIBLICAL QUOTATIONS IN HEBREW-ARAMAIC DOCUMENTS

Authors:

Yaakov Hacohen-Kerner

Abstract: Quotations in a text document contain important information about the content, the context, the sources that the author uses, their importance and impact. Therefore, automatic identification of quotations from documents is an important task. Quotations included in rabbinic literature are difficult to identify and to extract for various reasons. The aim of this research is to automatically identify Biblical quotations included in rabbinic documents written in Hebrew-Aramaic. We deal with various kinds of quotations: partial, missing and incorrect. We formulate nineteen features to identify these quotations. These features were divided into seven different feature sets: matches, best matches, sums of weights, weighted averages, weighted medians, common words, and quotation indicators. Several features are novel. Experiments on various combinations of these features were performed using four common machine learning methods. A combination of 17 features using J48 (an improved version of C4.5) achieves an accuracy of 91.2%, which is an improvement of about 8% compared to a baseline result.

Paper Nr: 113
Title:

TOWARDS LEARNING WITH OBJECTS IN A HIERARCHICAL REPRESENTATION

Authors:

Nicolas Cebron

Abstract: In most supervised learning tasks, objects are perceived as a collection of fixed attribute values. In this work, we try to extend this notion to a hierarchy of attribute sets with different levels of quality. When we are given the objects in this representation, we might consider to learn from most examples at the lowest quality level and only to enhance a few examples for the classification algorithm. We propose an approach for selecting those interesting objects and demonstrate its superior performance to random selection.

Paper Nr: 116
Title:

TEXT SIMPLIFICATION USING DEPENDENCY PARSING FOR SPANISH

Authors:

Miguel Ballesteros

Abstract: In this paper we investigate the task of text simplification for Spanish. Our purpose is a system to simplified text based on rules using dependency parsing. Our main motivation is the need for text simplification to facilitate accessibility to information by poor readers and by people with cognitive disabilities. This study consists of the first step towards building Spanish text simplification systems helping to create easy-to-read texts.

Paper Nr: 119
Title:

GIVING SHAPE TO AN N–VERSION DEPENDENCY PARSER - Improving Dependency Parsing Accuracy for Spanish using Maltparser

Authors:

Miguel Ballesteros and Jesús Herrera

Abstract: Maltparser is a contemporary dependency parsing machine learning–based system that shows great accuracy. However 90% of the Labelled Attachment Score (LAS) seems to be a de facto limit for these kinds of parsers. In this paper we present an n–version dependency parser that will work as follows: we found that there is a small set of words that are more frequently incorrectly parsed so the n-version dependency parser consists of n different parsers trained specifically to parse those difficult words. An algorithm will send each word to each parser and combined with the action of a general parser we will achieve better overall accuracy. This work has been developed specifically for Spanish using Maltparser.

Paper Nr: 121
Title:

LOGIC OF DISCOVERY, DATA MINING AND SEMANTIC WEB - Position Paper

Authors:

Jan Rauch

Abstract: Logic of discovery was developed in 1970’s as an answer to questions ”Can computers formulate and justify scientific hypotheses?” and ”Can they comprehend empirical data and process it rationally, using the apparatus of modern mathematical logic and statistics to try to produce a rational image of the observed empirical world?”. Logic of discovery is based on two semantic systems. Observational semantic system corresponds to observational data and statements on observational data. Theoretical semantic system concerns suitable state dependent structures. Both systems are related via inductive inference rules corresponding to statistical approaches. An attempt to modify logic of discovery to data mining was made and a framework making possible to deal with domain knowledge in data mining was developed. Possibility of enhancement of this framework for presenting results of data mining through Semantic web is suggested and discussed.

Paper Nr: 122
Title:

INTEGRATED INSTANCE-BASED AND KERNEL METHODS FOR POWER QUALITY KNOWLEDGE MODELING

Authors:

Mennan Güder

Abstract: In this paper, an integrated knowledge discovery strategy for high dimensional spatial power quality event data is proposed. Real time, distributed measuring of the electricity transmission system parameters provides huge number of time series power quality events. The proposed method aims to construct characteristic event distribution and interaction models for individual power quality sensors and the whole electricity transmission system by considering feasibility, time and accuracy concerns. In order to construct the knowledge and prediction model for the power quality domain; feature construction, feature selection, event clustering, and multi-class support vector machine supervised learning algorithms are employed.

Paper Nr: 123
Title:

KNOWLEDGE-BASED MINING OF PATTERNS AND STRUCTURE OF SYMBOLIC MUSIC FILES

Authors:

Frank Seifert

Abstract: To date, there are no systems that can identify symbolic music in a generic way. That is, it should be possible to associate the countless potential occurrences of a certain song with at least one generic description. The contribution of this paper is twofold: First, we sketch a generic model for music representation. Second, we develop a framework that correlates free symbolic piano performances with such a knowledge base. Based on detected pattern instances, the framework generates hypotheses for higher-level structures and evaluates them continuously. Thus, one or more hypotheses about the identity of such a music performance should be delivered and serve as a starting point for further processing stages. Finally, the framework is tested on a database of symbolic piano music.

Paper Nr: 127
Title:

THE TYPHOON TRACK CLASSIFICATION USING TRI-PLOTS AND MARKOV CHAIN

Authors:

John Chien-Han Tseng

Abstract: We aim at understanding typhoon tracks by classifying them into the ENSO and La Niña types. Two methods, namely, tri-plots and Markov chain combined with a novel dissimilarity measure for trajectory data are proposed in this work. The calculation of the tri-plots can help us to separate ENSO from La Niña year typhoon tracks with the training error about 0.023 to 0.268 and the test error about 0.271 to 0.334. The Markov chain based dissimilarity measure, combined with the SSVM classifier can help us to classify tracks with the training error around 0.031 to 0.173 and the test error around 0.181 to 0.287. Moreover, for the purpose of visualization, the tri-plots or Markov chain-based method maps the typhoon track data into low dimensional space. In the space, the typhoon tracks of small dissimilarity should be regarded as one group. The map can be very helpful for catching the hidden pattern of ENSO and La Niña atmospheric circulation for establishing typhoon databases. In general, we believe that tri-plots and Markov chain-based method are useful tools for the typhoon track classification problem and should merit further investigation in related research community.

Paper Nr: 128
Title:

PREDICTING GROUND-BASED AEROSOL OPTICAL DEPTH WITH SATELLITE IMAGES VIA GAUSSIAN PROCESSES

Authors:

Goo Jun and Joydeep Ghosh

Abstract: A Gaussian process regression technique is proposed to predict ground-based aerosol optical depth measurements from satellite multispectral images, and to select the most informative ground-based sites by active learning. Satellite images provide spatial and temporal information in addition to the spectral features, and such heterogeneity of available features is captured in the Gaussian process model by employing an additive set of covariance functions. By finding an optimal set of hyperparameters, relevance of each additional information is automatically determined. Experiments show that the spatio-temporal information contributes significantly to the regression results. The prediction results are not only more accurate but also more interpretable than existing approaches. For active learning, each spatio-temporal setup is evaluated by an uncertainty-sampling algorithm. The results show that the active selection process benefits most from the spatial information.

Posters
Paper Nr: 10
Title:

RECOMMENDATION SYSTEM IN AN AUDIOVISUAL DELIVERY PLATFORM

Authors:

Jose Mª Quinteiro-González, Ernestina Martel-Jordán, Pablo Hernández-Morera and Aaron López-Rodríguez

Abstract: One of the main tasks of the information services is to help users to find information that satisfies their preferences reducing their search effort. Recommendation systems filter information and only show the most preferred items. Ontologies are fundamental elements of the Semantic Web and have been exploited to build more accurate and personalized recommendations by inferencing missing user preferences. With catalogs changing continuously ontologies must be built autonomously without expert intervention. In this paper we present an audiovisual recommendation engine which uses an enhanced ontology filtering technique to recommend audiovisual content. Experimental results show that the improvements of the ontology filtering technique generate accurate recommendations.

Paper Nr: 11
Title:

ONTOLOGY-DRIVEN CONCEPTUAL DOCUMENT CLASSIFICATION

Authors:

Gordana Pavlović-Lažetić and Jelena Graovac

Abstract: Document classification based on the lexical-semantic network, wordnet, is presented. Two types of document classification in Serbian have been experimented with – classification based on chosen concepts from Serbian WordNet (SWN) and proper names-based classification. Conceptual document classification criteria are constructed from hierarchies rooted in a set of chosen concepts (first case) or in hierarchies rooted in some of the proper names' hypernyms (second case). A classificator of the first type is trained and then tested on an indexed and already classified Ebart corpus of Serbian newspapers (476917 articles). Precision, recall and F-measure show that this type of classification is promising although incomplete due mainly to SWN incompleteness. In the context of proper names-based classification, a proper names ontology based on the SWN is presented in the paper. A distance based similarity measure is defined, based on Euclidean and Manhattan distances. Classification of a subset of Contemporary Serbian Language Corpus is presented.

Paper Nr: 17
Title:

CONTEXT VECTOR CLASSIFICATION - Term Classification with Context Evaluation

Authors:

Hendrik Schöneberg

Abstract: Automated Deep Tagging heavily relies on a term’s proper recognition. If its syntax is obfuscated by spelling mistakes, OCR errors or typing variants, regular string matching or pattern matching algorithms may not be able to succeed with the classification. Context Vector Tagging is an approach which analyzes term co-occurrence data and represents it in a vector space model, paying specific respect to the source’s language. Utilizing the cosine angle between two context vectors as similarity measure, we propose, that terms with similar context vectors share a similar word class, thus allowing even unknown terms to be classified. This approach is especially suitable to tackle the above mentioned syntactical problems and can support classic string- or pattern-based classificator-algorithms in syntactically challenging environments.

Paper Nr: 24
Title:

SEAR - Scalable, Efficient, Accurate, Robust kNN-based Regression

Authors:

Aditya Desai

Abstract: Regression algorithms are used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. Statistical approaches although popular, are not generic in that they require the user to make an intelligent guess about the form of the regression equation. In this paper we present a new regression algorithm SEAR – Scalable, Efficient, Accurate kNN-based Regression. In addition to this, SEAR is simple and outlier-resilient. These desirable features make SEAR a very attractive alternative to existing approaches. Our experimental study compares SEAR with fourteen other algorithms on five standard real datasets, and shows that SEAR is more accurate than all its competitors.

Paper Nr: 25
Title:

CLUSTERING DOCUMENTS WITH LARGE OVERLAP OF TERMS INTO DIFFERENT CLUSTERS BASED ON SIMILARITY ROUGH SET MODEL

Authors:

Nguyen Chi Thanh

Abstract: Similarity rough set model for document clustering (SRSM) uses a generalized rough set model based on similarity relation and term co-occurrence to group documents in the collection into clusters. The model is extended from tolerance rough set model (TRSM) (Ho and Funakoshi, 1997). The SRSM methods have been evaluated and the results showed that it perform better than TRSM. However, in document collections where there are words overlapped in different document classes, the effect of SRSM is rather small. In this paper we propose a method to improve the performance of SRSM method in such document collections.

Paper Nr: 27
Title:

BROADER PERCEPTION FOR LOCAL COMMUNITY IDENTIFICATION

Authors:

F. T. W. (Frank) Koopmans and Th. P. (Theo) van der Weide

Abstract: A local community identification algorithm can identify the network community of a given start node without knowledge of the entire network. Such algorithms only consider nodes within or directly adjacent to the local community. Therefore a local algorithm is more effective than an algorithm that partitions the entire network when only a small portion of a large network is of interest or when it is difficult to obtain information about the network (such as the world wide web). However, local algorithms cannot deliver the same quality as their global counterparts that use the entire network. We propose an improvement to local community identification algorithms that will decrease the gap between relevant network knowledge of global and local methods. Benchmarks on synthetic networks show our approach increases the quality of locally identified communities in general and a decrease of the dependency on specific source nodes.

Paper Nr: 30
Title:

DICTIONARY EXTENSION FOR IMPROVING AUTOMATED SENTIMENT DETECTION

Authors:

Johannes Liegl and Stefan Gindl

Abstract: This paper investigates approaches to improve the accuracy of automated sentiment detection in textual knowledge repositories. Many high-throughput sentiment detection algorithms rely on sentiment dictionaries containing terms classified as either positive or negative. To obtain accurate and comprehensive sentiment dictionaries, we merge existing resources into a single dictionary and extend this dictionary by means of semisupervised learning algorithms such as Pointwise Mutual Information - Information Retrieval (PMI-IR) and Latent Semantic Analysis (LSA). The resulting extended dictionary is then evaluated on various datasets from different domains, which were annotated on both the document and sentence level.

Paper Nr: 34
Title:

PCF: PROJECTION-BASED COLLABORATIVE FILTERING

Authors:

Ibrahim Yakut

Abstract: Collaborative filtering (CF) systems are effective solutions for information overload problem while contributing web personalization. Different memory-based algorithms operating over entire data set have been utilized for CF purposes. However, they suffer from scalability, sparsity, and cold start problems. In this study, in order to overcome such problems, we propose a new approach based on projection matrix resulted from principal component analysis (PCA). We analyze the proposed scheme computationally; and show that it guarantees scalability while getting rid of sparsity and cold start problems. To evaluate the overall performance of the scheme, we perform experiments using two well-known real data sets. The results demonstrate that our scheme is able to provide accurate predictions efficiently. After analyzing the outcomes, we present some suggestions.

Paper Nr: 37
Title:

CONDITIONAL RANDOM FIELDS FOR TERM EXTRACTION

Authors:

Xing Zhang

Abstract: In this paper, we describe how to construct a machine learning framework that utilizes syntactic information in extraction of biomedical terms. Conditional random fields (CRF), is used as the basis of this framework. We make an effort to find the appropriate use for syntactic information, including parent nodes, syntactic paths and term ratios under the machine learning framework. The experiment results show that syntactic paths and term ratios can improve precision of term extraction, including old terms and novel terms. However, the recall rate of novel terms still needs to be increased. This research serves as an example for constructing machine learning based term extraction systems that utilizes linguistic information.

Paper Nr: 40
Title:

A CONTENT-BASED APPROACH TO RELEVANCE FEEDBACK IN XML-IR FOR CONTENT AND STRUCTURE QUERIES

Authors:

Luis M. de Campos and Juan M. Fernández-Luna

Abstract: The use of structured documents following XML representation allows us to create content and structure (CAS) queries which are more specific for the user’s needs. In this paper we are going to study how to enrich this kind of queries with the user feedback in order to get results closer to their needs. More formally, we are considering how to perform Relevance Feedback (RF) for CAS queries in XML Information Retrieval. Our approach maintains the same structural constraints but expands the content of the queries by adding new keywords to the original CAS query. These new terms are selected by considering their presence/absence in the judged units. This RF method is integrated in a XML-based search engine and evaluated with the INEX 2006 and INEX 2007 collections.

Paper Nr: 50
Title:

A NEW VISUAL DATA MINING TOOL FOR GVSIG GIS

Authors:

Romel Vázquez-Rodríguez, Carlos Pérez-Risquet and Inti Y. Gonzalez-Herrera

Abstract: The integration of scientific visualization (ScVis) techniques into geographic information systems (GIS) is an innovative alternative for the visual analysis of scientific data. Providing GIS with such tools improves the analysis and understanding of datasets with very low spatial density and allows to find correlations between variables in time and space. This paper presents a new visual data mining tool for the GIS gvSIG. This tool is implemented as a gvSIG module and contains several ScVis techniques for multiparameter data with a wide range of possibilities for interaction with the data. The developed module is a powerful visual data mining and data visualization tool to obtain knowledge from multiple datasets in time and space. A real case study with meteorological data from Villa Clara province (Cuba) is presented, where the implemented visualization techniques were used to analyze the available datasets. Although it is tested with meteorological data, the developed module is general and can be used in multiple application fields.

Paper Nr: 57
Title:

A COMPREHENSIVE SOLUTION TO PROCEDURAL KNOWLEDGE ACQUISITION USING INFORMATION EXTRACTION

Authors:

Ziqi Zhang

Abstract: Procedural knowledge is the knowledge required to perform certain tasks. It forms an important part of expertise, and is crucial for learning new tasks. This paper summarises existing work on procedural knowledge acquisition, and identifies two major challenges that remain to be solved in this field; namely, automating the acquisition process to tackle bottleneck in the formalization of procedural knowledge, and enabling machine understanding and manipulation of procedural knowledge. It is believed that recent advances in information extraction techniques can be applied compose a comprehensive solution to address these challenges. We identify specific tasks required to achieve the goal, and present detailed analyses of new research challenges and opportunities. It is expected that these analyses will interest researchers of various knowledge management tasks, particularly knowledge acquisition and capture.

Paper Nr: 59
Title:

EXTRACTING MAIN CONTENT-BLOCKS FROM BLOG POSTS

Authors:

Saiful Akbar

Abstract: A blog post typically contains defined blocks containing different information such as the main content, a blogger profile, links to blog archives, comments, and even advertisements. Thus, identifying and extracting the main/content block of blog posts or web pages in general is important for information extraction purposes before further processing. This paper describes our approach for extracting main/content block from blog posts with disparate types of blog mark-up. Adapting the Content Structure Tree (CST)-based approach, our approach proposed a new consideration in calculating the importance of HTML content nodes and in definition of the attenuation quotient suffered by HTML item/block nodes. Performance using this approach is increased because posts published in the same domain tend to have similar page template, such that a general main content marker could be applied for them. The approach consists of two steps. In the first step, the approach employs the modified CST approach for detecting the primary and secondary markers for page cluster. In the next step, it uses HTMLFilter to extract the main block of a page, based on the detected markers. When HTMLFilter cannot find the main block, the modified CST is used as the second alternative. Some experiments showed that the approach can extract main block with an accuracy of more than 94%.

Paper Nr: 60
Title:

AUTOLSA: AUTOMATIC DIMENSION REDUCTION OF LSA FOR SINGLE-DOCUMENT SUMMARIZATION

Authors:

Haidi Badr

Abstract: The role of text summarization algorithms is increasing in many applications; especially in the domain of information retrieval. In this work, we propose a generic single-document summarizer which is based on using the Latent Semantic Analysis (LSA). Generally in LSA, determining the dimension reduction ratio is usually performed experimentally which is data and document dependent. In this work, we propose a new approach to determine the dimension reduction ratio, DRr, automatically to overcome the manual determination problems. The proposed approach is tested using two benchmark datasets; namely DUC02 and LDC2008T19. The experimental results illustrate that the dimension reduction ratio obtained automatically improves the quality of the text summarization while providing a more optimal value for the DRr.

Paper Nr: 65
Title:

INITIAL EXPERIMENTS WITH EXTRACTION OF STOPWORDS IN HEBREW

Authors:

Yaakov HaCohen-Kerner and Shmuel Yishai Blitz

Abstract: Stopwords are regarded as meaningless in terms of information retrieval. Various stopword lists have been constructed for English and a few other languages. However, to the best of our knowledge, no stopword list has been constructed for Hebrew. In this ongoing work, we present an implementation of three baseline methods that attempt to extract stopwords for a data set containing Israeli daily news. Two of the methods are state-of-the-art methods previously applied to other languages and the third method is proposed by the authors. Comparison of the behavior of these three methods to the behavior of the Zipf's law shows that Zipf’s succeeds to describe the distribution of the top occurring words according to these methods.

Paper Nr: 66
Title:

INTERLEAVING FORWARD BACKWARD FEATURE SELECTION

Authors:

Michael Siebers and Ute Schmid

Abstract: Selecting appropriate features has become a key task when dealing with high-dimensional data. We present a new algorithm designed to find an optimal solution for classification tasks. Our approach combines forward selection, backward elimination and exhaustive search. We demonstrate its capabilities and limits using artificial and real world data sets. Regarding artificial data sets interleaving forward backward selection performs similar as other well known feature selection methods.

Paper Nr: 76
Title:

TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS

Authors:

Vladimir Bartik and Radek Burget

Abstract: The number of pages on the World Wide Web is permanently growing and there is a need to process pages efficiently and obtain some useful knowledge from them. Web page categorization is a very important issue in this area. The method proposed here takes both visual and textual information into consideration. It consists of two phases. In the first phase, web page areas obtained by segmentation are classified based on their visual properties, and in the second phase, pages are classified, based on information from the first phase and textual information. Several experiments with web pages taken from news web sites are presented in the final part of the paper.

Paper Nr: 81
Title:

A UNIFYING VIEW OF CONTEXTUAL ADVERTISING AND RECOMMENDER SYSTEMS

Authors:

Giuliano Armano and Eloisa Vargiu

Abstract: From a general perspective, nothing prevents from viewing contextual advertising as a kind of Web recommendation, aimed at embedding into a Web page the most relevant textual ads available for it. In fact, the task of suggesting an advertising is a particular case of recommending an item (the advertising) to a user (the web page), and vice versa. We envision that bringing ideas from contextual advertising could help in building novel recommender systems with improved performance, and vice versa. To this end, in this paper, we propose a unifying view of contextual advertising and recommender systems. To this end, we suggest: (i) a way to build a recommender system inspired by a generic solution typically adopted to solve contextual advertising tasks and (ii) a way to realize a collaborative contextual advertising system a la mode of collaborative filtering.

Paper Nr: 83
Title:

AN ARCHITECTURE FOR COLLABORATIVE DATA MINING

Authors:

Francisco Correia

Abstract: Collaborative Data Mining (CDM) develops techniques to solve complex problems of data analysis requiring sets of experts in different domains that may be geographically separate. An important issue in CDM is the sharing of experience among the different experts. In this paper we report on a framework that enables users with different expertise to perform data analysis activities and profit, in a collaborative fashion, from expertise and results of other researchers. The collaborative process is supported by web services that seek for relevant knowledge available among the collaborative web sites. We have successfully designed and deployed a prototype for collaborative Data Mining in domains of Molecular Biology and Chemoinformatics.

Paper Nr: 85
Title:

LEAD DISCOVERY IN THE WEB

Authors:

Iaakov Exman and Michal Pinto

Abstract: The Web is a huge and very promising source of medical drug leads. But, conventional search with generic search engines, does not really obtain novel discoveries. Inspired by the drug discovery approach, we add the idea of a "lead" to the search process. The resulting Lead-&-Search protocol avoids the trap of repeated fruitless search and is domain independent. The approach is applied in practice to drug leads in the Web. To serve as input to generic search engines, multi-dimensional chemical structures are linearized into strings, which are sliced into keyword components. Search results are reordered by novelty criteria relevant to drug discovery. Case studies demonstrate the approach: linearized components produce specific search outcomes, with low risk of semantic ambiguity, facilitating reordering and filtering of results.

Paper Nr: 87
Title:

SPATIO-TEMPORAL BLOCK MODEL FOR VIDEO INDEXATION ASSISTANCE

Authors:

Alain Simac-Lejeune

Abstract: In the video indexing framework, we have developed an assistance system for the user to define a new concept as semantic index according to the features automatically extracted from the video. Because the manual indexing is a long and tedious task, we propose to focus the attention of the user on pre selected prototypes that a priori correspond to the concept. The proposed system is decomposed in three steps. In the first one, some basic spatio-temporal blocks are extracted from the video, a particular block is associated to a particular property of one feature. In the second step, a Question/Answer system allows the user to define links between basic blocks in order to define concept block models. And finally, some concept blocks are extracted and proposed as prototypes of the concepts. In this paper, we present the two first steps, particularly the block structure, illustrated by an example of video indexing that corresponds to the concept running in athletic videos.

Paper Nr: 88
Title:

SEMANTIC IDENTIFICATION AND VISUALIZATION OF SIGNIFICANT WORDS WITHIN DOCUMENTS - Approach to Visualize Relevant Words within Documents to a Search Query by Word Similarity Computation

Authors:

Karolis Kleiza

Abstract: This paper gives at first an introduction to similarity computation and text summarization of documents by usage of a probabilistic topic model, especially Latent Dirichlet Allocation (LDA). Afterwards it provides a discussion about the need of a better understanding for the reason and transparency at all for the end-user why documents with a computed similarity actually are similar to a given search query. The authors propose for that an approach to identify and highlight words with respect to their semantic relevance directly within documents and provide a theoretical background as well as an adequate visual assignment for that approach.

Paper Nr: 90
Title:

INTEGRATED CANDIDATE GENERATION IN PROCESSING BATCHES OF FREQUENT ITEMSET QUERIES USING APRIORI

Authors:

Piotr Jedrzejczak and Marek Wojciechowski

Abstract: Frequent itemset mining can be regarded as advanced database querying where a user specifies constraints on the source dataset and patterns to be discovered. Since such frequent itemset queries can be submitted to the data mining system in batches, a natural question arises whether a batch of queries can be processed more efficiently than by executing each query individually. So far, two methods of processing batches of frequent itemset queries have been proposed for the Apriori algorithm: Common Counting, which integrates only the database scans required to process the queries, and Common Candidate Tree, which extends the concept by allowing the queries to also share their main memory structures. In this paper we propose a new method called Common Candidates, which further integrates processing of the queries from a batch by performing integrated candidate generation.

Paper Nr: 91
Title:

SELECTIVELY LEARNING CLUSTERS IN MULTI-EAC

Authors:

André Lourenço and Ana Fred

Abstract: The Multiple-Criteria Evidence Accumulation Clustering (Multi-EAC) method, is a clustering ensemble approach with an integrated cluster stability criterion used to selectively learn the similarity from a collection of different clustering algorithms. In this work we analyze the original Multi-EAC criterion in the context of the classical relative validation criteria, and propose alternative cluster validation indices for the selection of clusters based on pairwise similarities. Taking several clustering ensemble construction strategies as context, we compare the adequacy of each criterion and provide guidelines for its application. Experimental results on benchmark data sets show the proposed concepts.

Paper Nr: 96
Title:

SEMANTIC MEASURES BASED ON WORDNET USING MULTIPLE INFORMATION SOURCES

Authors:

Mamoun Abu Helou and Adnan Abid

Abstract: Recognizing semantic similarity between words is a generic problem for many applications of computational linguistics and artificial intelligence, such as text retrieval, classification and clustering. In this paper we investigate a new approach for measuring semantic similarity that combines methods of existing approaches that use different information sources in their similarity calculations namely, shortest path length between compared words, depth in the taxonomy hierarchy, information content, semantic density of compared words, and the gloss of words. We evaluate our measure against a benchmark set of human similarity ratings and the results show that our approach demonstrates better semantic measures as compared to the existing approaches.

Paper Nr: 117
Title:

DETECTING PARALLEL BROWSING TO IMPROVE WEB PREDICTIVE MODELING

Authors:

Geoffray Bonnin

Abstract: Present-day web browsers possess several features that facilitate browsing tasks. Among these features, one of the most useful is the possibility of using tabs. Nowadays, it is very common for web users to use several tabs and to switch from one to another while navigating. Taking into account parallel browsing is thus becoming very important in the frame of web usage mining. Although many studies about web users’ navigational behavior have been conducted, few of these studies deal with parallel browsing. This paper is dedicated to such a study. Taking into account parallel browsing involves to have some information about when tab switches are performed in user sessions. However, navigation logs usually do not contain such informations and parallel sessions appear in a mixed fashion. Therefore, we propose to get this information in an implicit way. We thus propose the TABAKO model, which is able to detect tab switches in raw navigation logs and to benefit from such a knowledge in order to improve the quality of web recommendations.

Paper Nr: 118
Title:

COMPARISON OF NEURAL NETWORKS USED FOR PROCESSING AND CATEGORIZATION OF CZECH WRITTEN DOCUMENTS

Authors:

Pavel Mautner and Roman Mouček

Abstract: The Kohonen Self-organizing Feature Map (SOM) has been developed for the clustering of input vectors and for projection of continuous high-dimensional signal to discrete low-dimensional space. The application area, where the map can be also used, is the processing of collections of text documents. The basic principles of the WEBSOM method, a transformation of text information into a real components feature vector and results of documents classification are described in the article. The Carpenter-Grossberg ART-2 neural network, usually used for adaptive vector clustering, was also tested as a document categorization tool. The results achieved by using this network are also presented here.

Paper Nr: 125
Title:

CALCULATING SEMANTIC SIMILARITY BETWEEN FACTS

Authors:

Sergey Afonin and Denis Golomazov

Abstract: The present paper is devoted to the calculation of semantic similarity between facts or events. A fact is considered as a single natural sentence including three parts, “what happened”, “where” and “when”. Possible types of mismatches between facts are discussed and a function calculating the semantic similarity is proposed. Very preliminary experimental results are presented.

Paper Nr: 126
Title:

INTELLIGENT APPROACH TO TRAIN WAVELET NETWORKS FOR RECOGNITION SYSTEM OF ARABIC WORDS

Authors:

Ridha Ejbali

Abstract: In this work, we carried out a research on speech recognition system particularly recognition system of Arabic words based on wavelet network. Our approach of speech recognition is divided into three parts: parameterization phase, training phase and recognition phase. This paper aims at introducing an intelligent algorithm of training of wavelet network for words recognition system. It presents also experimental results and a comparison between old training algorithm based on randomly training of wavelet network and our new approach based on intelligent algorithm of training of wavelet network for recognition system of Arabic words.