KDIR 2009 Abstracts


Full Papers
Paper Nr: 12
Title:

A SIMPLE MEASURE OF THE KOLMOGOROV COMPLEXITY

Authors:

Evgeny Ivanko

Abstract: In this article we propose a simple method to estimate the Kolmogorov complexity of a finite word written over a finite alphabet. Usually it is estimated by the ratio of the length of a word’s archive to the original length of the word. This approach is not satisfactory for the theory of information because it does not give an abstract measure. Moreover Kolmogorov complexity approach is not satisfactory in the practical tasks of the compressibility estimation because it measures the potential compressibility by means of the compression itself. There is another measure of a word’s complexity - subword complexity, which is equal to the number of different subwords in the word. We show the computation difficulties connected with the usage of subword complexity and propose a new simple measure of a word’s complexity, which is practically convenient development of the notion of subword complexity.

Paper Nr: 34
Title:

DISCOVERING RELATIONSHIP ASSOCIATIONS IN LIFE SCIENCES USING ONTOLOGY AND INFERENCE

Authors:

Weisen Guo and Steven B. Kraines

Abstract: Over one million papers are published annually in life sciences. Bioinformatics and knowledge discovery fields aim to help researchers conduct scientific discovery using the existing published knowledge. Existing literature-based discovery methods and tools mainly use text-mining techniques to extract non-specified relationships between two concepts. We present an approach that uses semantic web techniques to measure the relevance between two relationships with specified types that involve a particular entity. We consider two highly relevant relationships as a relationship association. Relationship associations could help researchers generate scientific hypotheses or create computer-interpretable semantic descriptors for their papers. The relationship association extraction process is described and the results of experiments for extracting relationship associations from 392 semantic graphs representing MEDLINE papers are presented

Paper Nr: 52
Title:

TRIGRAMS’N’TAGS FOR LEXICAL KNOWLEDGE ACQUISITION

Authors:

Berenike Litz, Hagen Langer and Rainer Malaka

Abstract: In this paper we propose a novel approach that combines syntactic and context information to identify lexical semantic relationships. We compiled semi-automatically and manually created training data and a test set for evaluation with the first sentences fromthe German version ofWikipedia. We trained the Trigrams’n’Tags Tagger by Brants (Brants, 2000) with a semantically enhanced tagset. The experiments showed that the cleanliness of the data is far more important than the amount of the same. Furthermore, it was shown that bootstrapping is a viable approach to ameliorate the results. Our approach outperformed the competitive lexico-syntactic patterns by 7% leading to an F1-measure of .91.

Paper Nr: 60
Title:

DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL

Authors:

Shuguang Wang, Shyam Visweswaran and Milos Hauskrecht

Abstract: We are interested in enhancing information retrieval methods by incorporating domain knowledge. In this paper, we present a new document retrieval framework that learns a probabilistic knowledge model and exploits this model to improve document retrieval. The knowledge model is represented by a network of associations among concepts defining key domain entities and is extracted from a corpus of documents or from a curated domain knowledge base. This knowledge model is then used to perform concept-related probabilistic inferences using link analysis methods and applied to the task of document retrieval. We evaluate this new framework on two biomedical datasets and show that this novel knowledge-based approach outperforms the state-of-art Lemur/Indri document retrieval method.

Paper Nr: 61
Title:

DERIVING MODELS FOR SOFTWARE PROJECT EFFORT ESTIMATION BY MEANS OF GENETIC PROGRAMMING

Authors:

Athanasios Tsakonas and Georgios Dounias

Abstract: This paper presents the application of a computational intelligence methodology in effort estimation for software projects. Namely, we apply a genetic programming model for symbolic regression; aiming to produce mathematical expressions that (1) are highly accurate and (2) can be used for estimating the development effort by revealing relationships between the project’s features and the required work. We selected to investigate the effectiveness of this methodology into two software engineering domains. The system was proved able to generate models in the form of handy mathematical expressions that are more accurate than those found in literature.

Paper Nr: 75
Title:

AN IMPROVED FREQUENT PATTERN-GROWTH APPROACH TO DISCOVER RARE ASSOCIATION RULES

Authors:

R. Uday Kiran and P. Krishna Reddy

Abstract: In this paper we have proposed an improved approach to extract rare association rules. The association rules which involve rare items are called rare association rules. Mining rare association rules is difficult with single minimum support (minsup) based approaches like Apriori and FP-growth as they suffer from “rare item problem” dilemma. At high minsup, frequent patterns involving rare items will be missed and at low minsup, the number of frequent patterns explodes. To address “rare item problem”, efforts have been made in the literature by extending the “multiple minimum support” framework to both Apriori and FP-growth approaches. The approaches proposed by extending “multiple minimum support” framework to Apriori require multiple scans on the dataset and generate huge number of candidate patterns. The approach proposed by extending the “multiple minimum support” framework to FP-growth is relatively efficient than Apriori based approaches, but suffers from performance problems. In this paper, we have proposed an improved multiple minimum support based FP-growth approach by exploiting the notions such as “least minimum support” and “infrequent leaf node pruning”. Experimental results on both synthetic and real world datasets show that the proposed approach improves the performance over existing approaches.

Paper Nr: 76
Title:

TEXTURE REPRESENTATION AND RETRIEVAL BASED ON MULTIPLE STRATEGIES

Authors:

Noureddine Abbadeni

Abstract: We propose an approach based on the fusion of multiple search strategies to content-based texture retrieval. Given the complexity of images and users’ needs, there is no model or system which is the best than all the others in all cases and situations. Therefore, the basic idea of multiple search strategies is to use several models, several representations, several search strategies, several queries, etc. and then fuse (merge) the results returned by each model, representation, strategy or query in a unique list by using appropriate fusion models. Doing so, search effectiveness (relevance) should be improved without necessarily altering, in an important way, search efficiency. We consider the case of homogeneous textures. Texture is represented by three (3) models/viewpoints. We consider also the special case of invariance and use both multiple representations and multiple queries to address this difficult problem. Benchmarking carried out on two (2) image databases show that retrieval relevance (effectiveness) is improved in a very appreciable way with the fused model.

Paper Nr: 77
Title:

SUMMARIZING SETS OF CATEGORICAL SEQUENCES - Selecting and Visualizing Representative Sequences

Authors:

Alexis Gabadinho, Gilbert Ritschard, Matthias Studer and Nicolas S. Müller

Abstract: This paper is concerned with the summarization of a set of categorical sequence data. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighborhood. The goal is to yield a representative set that exhibits the key features of the whole sequence data set and permits easy sounded interpretation. We propose an heuristic for determining the representative set that first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in TraMineR our R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social science data and should prove useful in many other domains.

Paper Nr: 94
Title:

UNSUPERVISED DISCRIMINANT EMBEDDING IN CLUSTER SPACES

Authors:

Eniko Szekely, Eric Bruno and Stephane Marchand-Maillet

Abstract: This paper proposes a new representation space, called the cluster space, for data points that originate from high dimensions. Whereas existing dedicated methods concentrate on revealing manifolds from within the data, we consider here the context of clustered data and derive the dimension reduction process from cluster information. Points are represented in the cluster space by means of their a posteriori probability values estimated using Gaussian Mixture Models. The cluster space obtained is the optimal space for discrimination in terms of the Quadratic Discriminant Analysis (QDA).Moreover, it is shown to alleviate the negative impact of the curse of dimensionality on the quality of cluster discrimination and is a useful preprocessing tool for other dimension reduction methods. Various experiments illustrate the effectiveness of the cluster space both on synthetic and real data.

Paper Nr: 96
Title:

MINING FOR RELEVANT TERMS FROM LOG FILES

Authors:

Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet and Mathieu Roche

Abstract: The Information extracted from log files of computing systems can be considered one of the important resources of information systems. In the case of Integrated Circuit design, log files generated by design tools are not exhaustively exploited. The logs of this domain are multi-source, multi-format, and have a heterogeneous and evolving structure. Moreover, they usually do not respect the grammar and the structures of natural language though they are written in English. According to features of such textual data, applying the classical methods of information extraction is not an easy task, more particularly for terminology extraction. We have previously introduced EXTERLOG approach to extract the terminology from such log files. In this paper, we introduce a new developed version of EXTERLOG guided by Web. We score the extracted terms by a Web and context based measure. We favor the more relevant terms of domain and emphasize the precision by filtering terms based on their scores. The experiments show that EXTERLOG is well-adapted terminology extraction approach from log files.

Paper Nr: 100
Title:

CLUSTER ENSEMBLE SELECTION - Using Average Cluster Consistency

Authors:

F. Jorge F. Duarte, João M. M. Duarte, M. Fátima C. Rodrigues and Ana L. N. Fred

Abstract: In order to combine multiple data partitions into a more robust data partition, several approaches to produce the cluster ensemble and various consensus functions have been proposed. This range of possibilities in the multiple data partitions combination raises a new problem: which of the existing approaches, to produce the cluster ensembles’ data partitions and to combine these partitions, best fits a given data set. In this paper, we address the cluster ensemble selection problem. We proposed a new measure to select the best consensus data partition, among a variety of consensus partitions, based on a notion of average cluster consistency between each data partition that belongs to the cluster ensemble and a given consensus partition. We compared the proposed measure with other measures for cluster ensemble selection, using 9 different data sets, and the experimental results shown that the consensus partitions selected by our approach usually were of better quality in comparison with the consensus partitions selected by other measures used in our experiments.

Short Papers
Paper Nr: 14
Title:

USER STUDY OF THE ASSIGNMENT OF OBJECTIVE AND SUBJECTIVE TYPE TAGS TO IMAGES IN INTERNET - Evaluation for Native and non Native English Language Taggers

Authors:

David Nettleton, Mari-Carmen Marcos and Bartolomé Mesa-Lao

Abstract: Image tagging in Internet is becoming a crucial aspect in the search activity of many users all over the world, as online content evolves from being mainly text based, to being multi-media based (text, images, sound, …). In this paper we present a study carried out for native and non native English language taggers, with the objective of providing user support depending on the detected language skills and characteristics of the user. In order to do this, we analyze the differences between how users tag objectively (using what we call ‘see’ type tags) and subjectively (by what we call ‘evoke’ type tags). We study the data using bivariate correlation, visual inspection and rule induction. We find that the objective/subjective factors are discriminative for native/non native users and can be used to create a data model. This information can be utilized to help and support the user during the tagging process.

Paper Nr: 15
Title:

ITEM-USER PREFERENCE MAPPING WITH MIXTURE MODELS - Data Visualization for Item Preference

Authors:

Yu Fujimoto, Hideitsu Hino and Noboru Murata

Abstract: In this paper, we propose a visualization technique of a statistical relation of users and preference of items based on a mixture model. In our visualization, items are given as points in a few dimensional preference space, and user specific preferences are given as lines in the same space. The relationship between items and user preferences are intuitively interpreted via projections from points onto lines. As a primitive implementation, we introduce a mixture of the Bradley-Terry models, and visualize the relation between items and user preferences with benchmark data sets.

Paper Nr: 22
Title:

ENTAILMENT OF CAUSAL QUERIES IN NARRATIVES USING ACTION LANGUAGE

Authors:

Pawan Goyal, Laxmidhar Behera and T. M. Mcginnity

Abstract: In this paper, Action Language formalism has been used to reason about narratives in a multi agent framework. The actions have been given a semantic frame representation. Hypothetical situations have been dealt using different states for world knowledge and agents’ knowledge. A notion of plan recognition has been proposed to answer causal queries. Finally, an algorithm has been proposed for automatically translating a given narrative into the representation and causal query entailment has been shown.

Paper Nr: 23
Title:

USING ASSOCIATION RULE MINING TO ENRICH SEMANTIC CONCEPTS FOR VIDEO RETRIEVAL

Authors:

Nastaran Fatemi, Florian Poulin, Laura E. Raileanu and Alan F. Smeaton

Abstract: In order to achieve true content-based information retrieval on video we should analyse and index video with high-level semantic concepts in addition to using user-generated tags and structured metadata like title, date, etc. However the range of such high-level semantic concepts, detected either manually or automatically, is usually limited compared to the richness of information content in video and the potential vocabulary of available concepts for indexing. Even though there is work to improve the performance of individual concept classifiers, we should strive to make the best use of whatever partial sets of semantic concept occurrences are available to us. We describe in this paper our method for using association rule mining to automatically enrich the representation of video content through a set of semantic concepts based on concept co-occurrence patterns. We describe our experiments on the TRECVid 2005 video corpus annotated with the 449 concepts of the LSCOM ontology. The evaluation of our results shows the usefulness of our approach.

Paper Nr: 36
Title:

FINDING PROTEIN FAMILY SIMILARITIES IN REAL TIME THROUGH MULTIPLE 3D AND 2D REPRESENTATIONS, INDEXING AND EXHAUSTIVE SEARCHING

Authors:

Eric Paquet and Herna Lydia Viktor

Abstract: Research suggests that the complex geometric shapes of amino-acid sequence folds often determine their functions. In order to aid domain experts to classify new protein structures, and to be able to identify the functions of such new discoveries, accurate shape-related algorithms for locating similar protein structures are thus needed. To this end, we present our Content-based Analysis of Protein Structure for Retrieval and Indexing system, which locates protein families, and identifies similarities between families, based on the 2D and 3D signatures of protein structures. Our approach is novel in that we utilize five different representations, using a query by prototype approach. These diverse representations provide us with the ability to view a particular protein structure, and the family it belongs to, focusing on (1) the C-α chain, (2) the atomic position, (3) the secondary structure, based on (4) residue type or (5) residue name. Our experimental results indicate that our method is able to accurately locate protein families, when evaluated against the 53.000 entries located within the Protein Data Bank performing an exhaustive search in less than a fraction of a second.

Paper Nr: 43
Title:

A UTILITY CENTERED APPROACH FOR EVALUATING AND OPTIMIZING GEO-TAGGING

Authors:

Albert Weichselbraun

Abstract: Geo-tagging is the process of annotating a document with its geographic focus by extracting a unique locality that describes the geographic context of the document as a whole (Amitay et al., 2004). Accurate geographic annotations are crucial for geospatial applications such as Google Maps or the IDIOM Media Watch on Climate Change (Hubmann-Haidvogel et al., 2009), but many obstacles complicate the evaluation of such tags. This paper introduces an approach for optimizing geo-tagging by applying the concept of utility from economic theory to tagging results. Computing utility scores for geo-tags allows a fine grained evaluation of the tagger’s performance in regard to multiple dimensions specified in use case specific domain ontologies and provides means for addressing problems such as different scope and coverage of evaluation corpora. The integration of external data sources and evaluation ontologies with user profiles ensures that the framework considers use case specific requirements. The presented model is instrumental in comparing different geotagging settings, evaluating the effect of design decisions, and customizing geo-tagging to a particular use cases.

Paper Nr: 53
Title:

DNA AND NATURAL LANGUAGES - Text Mining

Authors:

Gemma Bel-Enguix, Veronica Dahl and M. Dolores Jimenez-lopez

Abstract: We present, discuss and exemplify a fully implemented model of text mining that can be applied to spoken languages as well as to molecular biology languages. This is based in the model presented in (Zahariev et al., 2009) oriented to discovering DNA barcodes for sequences. The novelty of our methodology is the use of Constraint Based Reasoning to detect string repetitions through unification, by introducing a new general rule for matching. We claim that the same method can be succesfully applied to mining natural language texts.

Paper Nr: 54
Title:

A KDD APPROACH FOR DESIGNING FILTERING STRATEGIES TO IMPROVE VIRTUAL SCREENING

Authors:

Leo Ghemtio, Malika Smaïl-Tabbone, Marie-Dominique Devignes, Michel Souchet and Bernard Maigret

Abstract: Virtual screening has become an essential step in the early drug discovery process. Generally speaking, it consists in using computational techniques for selecting compounds from chemical libraries in order to identify drug-like molecules acting on a biological target of therapeutic interest. In the present study we consider virtual screening as a particular form of the KDD (Knowledge Discovery from Databases) approach. The knowledge to be discovered concerns the way a compound can be considered as a consistent ligand for a given target. The data from which this knowledge has to be discovered derive from diverse sources such as chemical, structural, and biological data related to ligands and their cognate targets. More precisely, we aim to extract filters from chemical libraries and protein-ligand interactions. In this context, the three basic steps of a KDD process have to be implemented. Firstly, a model-driven data integration step is applied to appropriate heterogeneous data found in public databases. This facilitates subsequent extraction of various datasets for mining. In a second step, mining algorithms are applied to the datasets and finally the most accurate knowledge units are eventually proposed as new filters. We present here this KDD approach and the experimental results we obtained with a set of ligands of the hormone receptor LXR.

Paper Nr: 56
Title:

EXPLORATIVE DATA MINING FOR THE SIZING OF POPULATION GROUPS

Authors:

Isis Peña, Herna Lydia Viktor and Eric Paquet

Abstract: In the apparel industry, an important challenge is to produce garments that fit various populations well. However, repeated studies of customers’ levels of satisfaction indicate that this is often not the case. The following questions come to mind. What, then, are the typical body profiles of a population? Are there significant differences between populations, and if so, which body measurements need special care when e.g. designing garments for Italian females? Within a population, would it be possible to identify the measurements that are of importance for different sizes and genders? Furthermore, assume that we have access to an accurate anthropometric database. Would there be a way to guide the data mining process to discover only those body measurements that are of the most interest for apparel designers? This paper describes our results when addressing these questions. To this end, we explore a database, containing anthropometric measurements and 3-D body scans, of samples of the North American, Italian and Dutch populations. Our results show that we accurately discover the relevant subsets of body measurements, through the use of objective interestingness measures-based feature selection and feature extraction, for the various body sizes within each population and gender.

Paper Nr: 57
Title:

BUILDING EXTENSIBLE 3D INTERACTION METADATA WITH INTERACTION INTERFACE CONCEPT

Authors:

Jacek Chmielewski

Abstract: Proliferation of 3D technology is visible in many aspects of current software developments. It is possible to find elements of 3D in many applications, from computer games, to professional engineering software, to end user software. It is especially important in the end user field, where 3D technology is usually used to provide natural and intuitive user interface. Fast development of such new applications can be driven by reuse of interactive 3D objects that are built for one application and reused in many others. Nowadays, number of such objects are stored in shared repositories and pose a challenge of finding right objects for new application. To efficiently search for interactive 3D objects it is required to provide metadata of object geometry, semantics and its interactions. Existing metadata standards can deal only with the first two areas. In this paper a new approach to interaction metadata is presented. The proposed Multimedia Interaction Model is accompanied by an Interaction Interface concept that allows creating interaction metadata not limited to a predefined set of description elements. Interaction Interface concept provides a method and tools for defining new description elements in a way suitable for automatic processing and automatic analysis by search engines.

Paper Nr: 80
Title:

INVARIANT CATEGORISATION OF POLYGONAL OBJECTS USING MULTI-RESOLUTION SIGNATURES

Authors:

Roberto Lam and J. M. Hans du Buf

Abstract: With the increasing use of 3D objects and models, mining of 3D databases is becoming an important issue. However, 3D object recognition is very time consuming because of variations due to position, rotation, size and mesh resolution. A fast categorisation can be used to discard non-similar objects, such that only few objects need to be compared in full detail. We present a simple method for characterising 3D objects with the goal of performing a fast similarity search in a set of polygonal mesh models. The method constructs, for each object, two sets of multi-scale signatures: (a) the progression of deformation due to iterative mesh smoothing and, similarly, (b) the influence of mesh dilation and erosion using a sphere with increasing radius. The signatures are invariant to 3D translation, rotation and scaling, also to mesh resolution because of proper normalisation. The method was validated on a set of 31 complex objects, each object being represented with three mesh resolutions. The results were measured in terms of Euclidian distance for ranking all objects, with an overall average ranking rate of 1.29.

Paper Nr: 83
Title:

TEXT CLASSIFICATION THROUGH TIME - Efficient Label Propagation in Time-Based Graphs

Authors:

Shumeet Baluja, Deepak Ravichandran and D. Sivakumar

Abstract: One of the fundamental assumptions for machine-learning based text classification systems is that the underlying distribution from which the set of labeled-text is drawn is identical to the distribution from which the text-to-be-labeled is drawn. However, in live news aggregation sites, this assumption is rarely correct. Instead, the events and topics discussed in news stories dramatically change over time. Rather than ignoring this phenomenon, we attempt to explicitly model the transitions of news stories and classifications over time to label stories that may be acquired months after the initial examples are labeled. We test our system, based on efficiently propagating labels in time-based graphs, with recently published news stories collected over an eighty day period. Experiments presented in this paper include the use of training labels from each story within the first several days of gathering stories, to using a single story as a label.

Paper Nr: 86
Title:

COMPUTATION OF THE SEMANTIC RELATEDNESS BETWEEN WORDS USING CONCEPT CLOUDS

Authors:

Swarnim Kulkarni and Doina Caragea

Abstract: Determining the semantic relatedness between two words refers to computing a statistical measure of similarity between those words. Word similarity measures are useful in a wide range of applications such as natural language processing, query recommendation, relation extraction, spelling correction, document comparison and other information retrieval tasks. Although several methods that address this problem have been proposed in the past, effective computation of semantic relatedness still remains a challenging task. In this paper, we propose a new technique for computing the relatedness between two words. In our approach, instead of computing the relatedness between the two words directly, we propose to first compute the relatedness between their generated concept clouds using web-based coefficients. Next, we use the obtained measure to determine the relatedness between the original words. Our approach heavily relies on a concept extraction algorithm that extracts concepts related to a given query and generates a concept cloud for the query concept. We perform an evaluation on the Miller-Charles benchmark dataset and obtain a correlation coefficient of 0.882, which is better than the correlation coefficients of all other existing state of art methods, hence providing evidence for the effectiveness of our method.

Paper Nr: 90
Title:

XHITS - Multiple Roles in a Hyperlinked Structure

Authors:

Francisco Benjamim Filho, Raul Pierre Renteria and Ruy Luiz Milidiú

Abstract: The WWW is a huge and rich environment. Web pages can be viewed as a large community of elements that are connected through links due to several issues. The HITS approach introduces two basic concepts, hubs and authorities, that reveal some hidden semantic information from the links. In this paper, we present XHITS, a generalization of HITS, that models multiple classes problems and a machine learning algorithm to calibrate it. We split classification influence into two sources. The first one is due to link propagation, whereas the second one is due to classification reinforcement. We derive a simple linear iterative equation to compute the classification values. We also provide an influence equation that shows how the two influence sources can be combined. Two special cases are explored: symmetric reinforcement and positive reinforcement. We show that for these two special cases the iterative scheme converges. Some illustrative examples and empirical test are also provided. They indicate that XHITS is a powerful and efficient modeling approach.

Paper Nr: 99
Title:

INTERESTINGNESS – A UNIFYING PARADIGM - Bipolar Function Composition

Authors:

Iaakov Exman

Abstract: Interestingness is an important criterion by which we judge knowledge discovery. But, interestingness has escaped all attempts to capture its intuitive meaning into a concise and comprehensive form. A unifying paradigm is formulated by function composition. We claim that composition is bipolar – i.e. composition of exactly two functions – whose two semantic poles are relevance and unexpectedness. The paradigm generality is demonstrated by case studies of new interestingness functions, examples of known functions that fit the framework, and counter-examples for which the paradigm points out to the lacking pole.

Paper Nr: 104
Title:

ARTIFICIAL DATA GENERATION FOR ONE-CLASS CLASSIFICATION - A Case Study of Dimensionality Reduction for Text and Biological Data

Authors:

Santiago D. Villalba and Pádraig Cunningham

Abstract: Artificial negatives have been employed in a variety of contexts in machine learning to overcome data availability problems. In this paper we explore the use of artificial negatives for dimension reduction in one-class classification, that is classification problems where only positive examples are available for training. We present four different strategies for generating artificial negatives and show that two of these strategies are very effective for discovering discriminating projections on the data, i.e., low dimension projections for discriminating between positive and real negative examples. The paper concludes with an assessment of the selection bias of this approach to dimension reduction for one-class classification.

Paper Nr: 105
Title:

A PATENT RETRIEVAL METHOD USING SEMANTIC ANNOTATIONS

Authors:

Youngho Kim, Jihee Ryu and Sung-Hyon Myaeng

Abstract: Automatic annotation of key phrases for their semantic categories can help improving effectiveness of a variety of text-based systems including information retrieval, summarization, question answering, etc. In this paper, we exploit semantic annotations for patent retrieval (i.e., patent invalidity search). We first annotated key phrases for two semantic categories, PROBLEM (e.g. “pattern matching”) and SOLUTION (e.g. “dynamic programming”) in a patent document, which constitute a particular technology. Semantic clusters are formed by grouping patent documents with the same PROBLEM or SOLUTION tag. A language modelling approach to information retrieval is extended to consider the semantically oriented clusters as well as document models. Our retrieval evaluation of the proposed approach using a collection of United States patent documents shows a 22% improvement over the baseline, a smoothed language modelling approach without using the semantic annotations.

Paper Nr: 113
Title:

AN ARTIFICIAL MOLECULAR MODEL TO FOSTER COMMUNITIES

Authors:

Christoph Schommer

Abstract: This paper introduces in extracts a bio-inspired model that understands graphs as artificial chemical constructs. The main objective is to identify this model as an autonomous and adaptive system that performs internal tasks, for example a communication with its environment. The model itself focus on artificial atomicity of nodes, artificial molecular connections in between, and functional proteins, which are self-concentrated constructs. The model implicates a solid fundament, but fosters an artificial vitality through catalysts: these merge attacked atomic nodes – in case of common “interests” (inside the molecular model) – to functional proteins and therefore consequently contribute to a vivid shape of communities. As an application example, the theoretical model is clarified with bibliographic entries to form bibliographic communities dynamically while having a bibliographic stream entries as input.

Paper Nr: 120
Title:

CHANGE OF TOPICS OVER TIME - Tracking Topics by their Change of Meaning

Authors:

Gerhard Heyer, Florian Holz and Sven Teresniak

Abstract: In this paper we present a new approach to the analysis of topics and their dynamics over time. Given a large amount of news text on a daily basis, we have identified “hotly discussed” concepts by examining the contextual shift between the time slices. We adopt the volatility measure from econometrics and propose a new algorithm for frequency-independent detection of topic drift.

Paper Nr: 121
Title:

HIERARCHICAL TAXONOMY EXTRACTION BY MINING TOPICAL QUERY SESSIONS

Authors:

Miguel Fernández-Fernández and Daniel Gayo-Avello

Abstract: Search engine logs store detailed information on Web users interactions. Thus, as more and more people use search engines on a daily basis, important trails of users common knowledge are being recorded in those files. Previous research has shown that it is possible to extract concept taxonomies from full text documents, while other scholars have proposed methods to obtain similar queries from query logs. We propose a mixture of both lines of research, that is, mining query logs not to find related queries nor query hierarchies but actual term taxonomies. In this first approach we have researched the feasibility of finding hyponymy relations between terms or noun-phrases by exploiting specialization search patterns in topical sessions, obtaining encouraging preliminary results.

Paper Nr: 124
Title:

TOPIC DETECTION IN BIBLIOGRAPHIC DATABASES

Authors:

Maria Biryukov

Abstract: Detection of research topics in scientific publications has attracted a lot of attention in the past few years. In this paper we introduce and compare various metrics of topic ranking, which allow to distinguish between general and focused topic terms. We use DBLP as a testbed for our experiments.

Paper Nr: 126
Title:

UNBOXING DATA MINING VIA DECOMPOSITION IN OPERATORS - Towards Macro Optimization and Distribution

Authors:

Alexander Wöehrer, Yan Zhang, Ehtesam-ul-Haq Dar and Peter Brezany

Abstract: Data mining deals with finding hidden knowledge patterns in often huge data sets. The work presented in this paper elaborates on defining data mining tasks in terms of fine-grained composable operators instead of coarse-grained black box algorithms. Data mining tasks in the knowledge discovery process typically need one relational table as input and data preprocessing and integration beforehand. The possible combination of different kind of operators (relational, data mining and data preprocessing operators) represents a novel holistic view on the knowledge discovery process. Initially, as described in this paper, for the low-level execution phase but yielding the potential for rich optimization similar to relational query optimization. We argue that such macro-optimization embracing the overall KDD process leads to improved performance instead of focusing on just a small part of it via micro-optimization.

Posters
Paper Nr: 4
Title:

2-CLASS EIGEN TRANSFORMATION CLASSIFICATION TREES

Authors:

Steven De Bruyne and Frank Plastria

Abstract: We propose a classification algorithm that extends linear classifiers for binary classification problems by looking for possible later splits to deal with remote clusters. These additional splits are searched for in directions given by several eigen transformations. The resulting structure is a tree that possesses unique properties that allow, during the construction of the classifier, the use of criteria that are more directly related to classification power than is the case with traditional classification trees. We show that the algorithm produces classifiers equivalent to linear classifiers where these latter are optimal, and otherwise offer higher flexibility while being more robust than traditional classification trees. It is shown how the classification algorithm can outperform traditional classification algorithms on a real life example. The new classifiers retain the level of interpretability of linear classifiers and traditional classification trees unavailable with more complex classifiers. Additionally, they not only allow to easily identify the main properties of the separate classes, but also to identify properties of potential subclasses.

Paper Nr: 13
Title:

LANGUAGE TECHNOLOGY FOR INFORMATION SYSTEMS

Authors:

Paul Schmidt, Mahmoud Gindiyeh and Gintare Grigonyte

Abstract: The project presents work carried out in a project funded by the German Ministry of Economy whose goal is to develop tools for information systems on the basis of high quality language technology and standard statistical methods. The paper shows how automatic indexation, automatic classification and information retrieval can be combined to efficiently create a high quality information processing system for expert knowledge in technical engineering.

Paper Nr: 21
Title:

DEVELOPMENT AND APPLICATION OF A ROACH GENE REGULATION PROFILE BASED GENDER DISCRIMINATION METHOD

Authors:

Yongxiang Fang

Abstract: This study extracted, from a roach gonad microarray data set, a gene regulation profile in contrast of male and female roaches. Then a method is developed to use this profile to discriminate the genders of the roaches involved in another roach microarray experiments in which the roaches are too young for their genders to be classifiable. The gender is an ignorable factor in roach gene expression study and the gender information is vital for the success of such a microarray study, because without the gender information the treatment effects could not be estimated correctly. A comparison of the analytical results of target data set based on with and without concerning the gender effects shows that the estimation of treatment effects is improved greatly when obtained gender information is incorporated in the data analysis. This is reversely evident that the roach gender discrimination method developed in this study performs very well.

Paper Nr: 49
Title:

ARABELLA - A Directed Web Crawler

Authors:

Pedro Lopes, Davide Pinto, David Campos and José Luís Oliveira

Abstract: The Internet is becoming the primary source of knowledge. However, its disorganized evolution brought about an exponential increase in the amount of distributed, heterogeneous information. Web crawling engines were the first answer to ease the task of finding the desired information. Nevertheless, when one is searching for quality information related to a certain scientific domain, typical search engines like Google are not enough. This is the problem that directed crawlers try to solve. Arabella is a directed web crawler that navigates through a predefined set of domains searching for specific information. It includes text-processing capabilities that increase the system’s flexibility and the number of documents that can be crawled: any structured document or REST web service can be processed. These complex processes do not harm overall system performance due to the multithreaded engine that was implemented, resulting in an efficient and scalable web crawler.

Paper Nr: 50
Title:

LINK INTEGRATOR - A Link-based Data Integration Architecture

Authors:

Pedro Lopes, Joel Arrais and José Luís Oliveira

Abstract: The evolution of the World Wide Web has created a great opportunity for data production and for the construction of public repositories that can be accessed all over the world. However, as our ability to generate new data grows, there is a dramatic increase in the need for its efficient integration and access to all the dispersed data. In specific fields such as biology and biomedicine, data integration challenges are even more complex. The amount of raw data, the possible data associations, the diversity of concepts and data formats, and the demand for information quality assurance are just a few issues that hinder the development of a general proposal and solid solutions. In this article we describe a lightweight information integration architecture that is capable of unifying, in a single access point, several heterogeneous bioinformatics data sources. The model is based on web crawling that automatically collects keywords related with biological concepts that are previously defined in a navigation protocol. This crawling phase allows the construction of a link-based integration mechanism that conducts users to the right source of information, keeping the original interfaces of available information and maintaining the credits of original data providers.

Paper Nr: 69
Title:

BEHAVIOR OF DIFFERENT IMAGE CLASSIFIERS WITHIN A BROAD DOMAIN

Authors:

B. Clemente, M. L. Durán, A. Caro and P. G. Rodríguez

Abstract: Image classification is one of the most important research tasks in the Content-Based Image Retrieval area. The term image categorization refers to the labeling of the images under one of a number of predefined categories. Although this task is usually not too difficult for humans, it has proved to be extremely complex for machines (or computer programs). The major issues concern variable and sometimes uncontrolled imaging conditions. This paper focuses on observation of behavior for different classifiers within a collection of general purpose images (photos). We carry out a contrastive study between the groups obtained from these mathematical classifiers and a prior classification developed by humans.

Paper Nr: 70
Title:

ANONIMYTEXT: ANONIMIZATION OF UNSTRUCTURED DOCUMENTS

Authors:

Rebeca Perez-Lainez, Ana Iglesias and Cesar de Pablo-Sanchez

Abstract: The anonymization of unstructured texts is nowadays a task of great importance in several text mining applications. Medical records anonymization is needed both to preserve personal health information privacy and enable further data mining efforts. The described ANONYMITEXT system is designed to de-identify sensible data from unstructured documents. It has been applied to Spanish clinical notes to recognize sensible concepts that would need to be removed if notes are used beyond their original scope. The system combines several medical knowledge resources with semantic clinical notes induced dictionaries. An evaluation of the semi-automatic process has been carried on a subset of the clinical notes on the most frequent attributes.

Paper Nr: 74
Title:

CLASSIFICATION BY SUCCESSIVE NEIGHBORHOOD

Authors:

David Grosser, Henri Ralambondrainy and Noel Conruyt

Abstract: Formalization of scientific knowledge in life sciences by experts in biology or Systematics produces arborescent representations whose values could be present, absent or unknown. To improve the robustness of the classification process of those complex objects, often partially described, we propose a new classification method which is iterative, interactive and semi-directed. It combines inductive techniques for the choice of discriminating variables and search for nearest neighbors based on various similarity measures which take into account structures and values of the objects for the neighborhood computation.

Paper Nr: 79
Title:

DISAMBIGUATING WEB SEARCH RESULTS BY TOPIC AND TEMPORAL CLUSTERING - A Proposal

Authors:

Ricardo Campos, Gaël Dias and Alípio Mário Jorge

Abstract: With so much information available on the web, looking for relevant documents on the Internet has become a difficult task. Temporal features play an important role with the introduction of a time dimension and the possibility to restrict a search by time, recreating a particular moment of a web page set. Despite its importance, temporal information is still under-considered by current search engines, limiting themselves to the capture of the most recent snapshot of the information. In this paper, we describe the architecture of a temporal search engine which uses timelines to browse search results. More specifically, we intend to add a time measure to cluster web page results, by analyzing web page contents, supporting the search of temporal and non-temporal information embedded in web documents.

Paper Nr: 82
Title:

BUSINESS INTELLIGENCE IN HIGHER EDUCATION - Managing the Relationships with Students

Authors:

Maria Beatriz Piedade and Maribel Yasmina Santos

Abstract: The closely monitoring of the students’ academic activities, the evaluation of their academic success and the approximation to their day-by-day academic activities are key factors in the promotion of the student’s academic success in higher education institutions. To be possible the implementation of monitoring processes and activities, it is essential the acquisition of knowledge about the students and their academic behaviour. This knowledge supports the decision-making associated with teaching-learning process, enhancing an effective institution-student relationship. This paper presents a Student Relationship Management (SRM) system that is under development. The SRM system supports the SRM concept and practice and has been implemented using concepts and technologies associated to the Business Intelligence systems. To demonstrate the SRM system relevance in the process of acquisition of knowledge about the students and in the support of actions and decisions based on such knowledge, an application case carried out in a real context is also presented.

Paper Nr: 84
Title:

SELECTING CATEGORICAL FEATURES IN MODEL-BASED CLUSTERING

Authors:

Cláudia M. V. Silvestre, Margarida M. G. Cardoso and Mario A. T. Figueiredo

Abstract: There has been relatively little research on feature/variable selection in unsupervised clustering. In fact, feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. The methods proposed for addressing this problem are mostly focused on numerical data. In this work, we propose an approach to selecting categorical features in clustering. We assume that the data comes from a finite mixture of multinomial distributions and implement a new expectation-maximization (EM) algorithm that estimate the parameters of the model and selects the relevant variables. The results obtained on synthetic data clearly illustrate the capability of the proposed approach to select the relevant features.

Paper Nr: 89
Title:

A LEARNING METHOD FOR IMBALANCED DATA SETS

Authors:

Jorge de la Calleja, Olac Fuentes, Jesús González and Rita M. Aceves-Pérez

Abstract: Many real-world domains present the problem of imbalanced data sets, where examples of one class significantly outnumber examples of other classes. This situation makes learning difficult, as learning algorithms based on optimizing accuracy over all training examples will tend to classify all examples as belonging to the majority class. In this paper we introduce a method for learning from imbalanced data sets which is composed of three algorithms. Our experimental results show that our method performs accurate classification in the presence of significant class imbalance and using small training sets.

Paper Nr: 92
Title:

MULTIDIMENSIONAL INFORMATION VISUALIZATION TECHNIQUES - Evaluating a Taxonomy of Tasks with the Participation of the Users

Authors:

Eliane Regina de Almeida Valiati, Josué Toebe, Antonio Flávio Gomes, Milene Andréa Guadagnin, Leandro Luis Bianchi and João Roberto Telles

Abstract: Multidimensional information visualization techniques has the potential to assist in the analysis and understanding of large volumes of data by detecting patterns, clusters and trends which are not obvious, when using non-graphical forms of presentation. When developing a visualization technique, the analytic and exploratory tasks that a user might need or want to perform on the data should guide the choice of the visual and interaction metaphors implemented by the technique. Usability tests of techniques for visualization also need a clear definition of tasks of the user. The identification and understanding of these tasks is a matter of recent research in the area of visualization of information and some works have proposed taxonomies to organize them. This paper describes an experimental evaluation of a classification based on the observation of different profiles of users performing tasks in exploratory data using multidimensional visualization techniques.

Paper Nr: 95
Title:

ONTOLOGICAL WAREHOUSING ON SEMANTICALLY INDEXED DATA - Reusing Semantic Search Engine Ontologies to Develop Multidimensional Schemas

Authors:

Filippo Sciarrone and Paolo Starace

Abstract: In this article we present a first experimentation of a Business Intelligence solution to dynamically develop multidimensional OLAP schemas through a reuse of ontologies, stored in concept and relations dictionaries and used by semantic indexing engines. The particular aspect of the proposed solution consists in the integration of semantic indexing techniques of non-structured documents, based on ontologies, with dynamic management techniques of unbalanced hierarchies in a Data Warehouse. As a case study, we embedded our solution into a real system, built for the analysis and management of experts’ curricula in an e-government environment. We show how it is possible to automatically build OLAP dimensions, inheriting the hierarchic structure of ontologies, with the goal of using the semantically indexed data to carry out multidimensional OLAP analyses. The first experimental results are encouraging.

Paper Nr: 101
Title:

CHARACTERIZING THE TRAFFIC DENSITY AND ITS EVOLUTION THROUGH MOVING OBJECT TRAJECTORIES

Authors:

Ahmed Kharrat, Karine Zeitouni, Iulian Sandu-Popa and Sami Faiz

Abstract: Managing and mining data derived from moving objects is becoming an important issue in the last years. In this paper, we are interested in mining trajectories of moving objects such as vehicles in the road network. We propose a method for discovering dense routes by clustering similar road sections according to both traffic and location in each time period. The traffic estimation is based on the collected spatiotemporal trajectories. We also propose a characterization approach of the temporal evolution of dense routes by a graph connecting dense routes over consecutive time periods. This graph is labeled by a degree of evolution. We have implemented and tested the proposed algorithms, which have shown their effectiveness and efficiency.

Paper Nr: 125
Title:

THE COLLABORATIVE LEARNING AGENT (CLA) IN TRIDENT WARRIOR 08 EXERCISE

Authors:

Charles Zhou, Ying Zhao and Chetan Kotak

Abstract: The Collaborative Learning Agent (CLA) technology is designed to learn patterns from historical Maritime Domain Awareness (MDA) data then use the patterns for identification and validation of anomalies and to determine the reasons behind the anomalies. For example, when a ship is found to be speeding up or slowing down using a traditional sensor-based movement information system such as Automatic Information System (AIS) data, by adding the CLA, one might be able to link the ship or its current position to the contextual patterns in the news, such as an unusual amount of commercial activities; typical weather, terrain and environmental conditions in the region; or areas of interest associated with maritime incidents, casualties, or military exercises. These patterns can help cross-validate warnings and reduce false alarms that come from other sensor-based detections.