KDIR 2014 Abstracts


Full Papers
Paper Nr: 7
Title:

Episode Rules Mining Algorithm for Distant Event Prediction

Authors:

Lina Fahed, Armelle Brun and Anne Boyer

Abstract: This paper focuses on event prediction in an event sequence, where we aim at predicting distant events. We propose an algorithm that mines episode rules, which are minimal and have a consequent temporally distant from the antecedent. As traditional algorithms are not able to mine directly rules with such characteristics, we propose an original way to mine these rules. Our algorithm, which has a complexity similar to that of state of the art algorithms, determines the consequent of an episode rule at an early stage in the mining process, it applies a span constraint on the antecedent and a gap constraint between the antecedent and the consequent. A new confidence measure, the temporal confidence, is proposed, which evaluates the confidence of a rule in relation to the predefined gap. The algorithm is validated on an event sequence of social networks messages. We show that minimal rules with a distant consequent are actually formed and that they can be used to accurately predict distant events.

Paper Nr: 10
Title:

URL-based Web Page Classification - A New Method for URL-based Web Page Classification Using n-Gram Language Models

Authors:

Tarek Amr Abdallah and Beatriz de la Iglesia

Abstract: This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL, without the need to visit the page itself. For example, emails or messages sent in social media may contain URLs and require automatic classification. The URL is very concise, and may be composed of concatenated words so classification with only this information is a very challenging task. Much of the current research on URL-based classification has achieved reasonable accuracy, but the current methods do not scale very well with large datasets. In this paper, we propose a new solution based on the use of an n-gram language model. Our solution shows good classification performance and is scalable to larger datasets. It also allows us to tackle the problem of classifying new URLs with unseen sub-sequences.

Paper Nr: 14
Title:

Incorporating Guest Preferences into Collaborative Filtering for Hotel Recommendation

Authors:

Fumiyo Fukumoto, Hiroki Sugiyama, Yoshimi Suzuki and Suguru Matsuyoshi

Abstract: Collaborative filtering (CF) has been widely used as a filtering technique because it is not necessary to apply more complicated content analysis. However, it is difficult to take users’ preferences/criteria related to the aspects of a product/hotel into account. This paper presents a method of hotel recommendation that incorporates different aspects of a product/hotel to improve quality of the score. We used the results of aspect-based sentiment analysis for guest preferences. The empirical evaluation using Rakuten Japanese travel data showed that aspect-based sentiment analysis improves overall performance. Moreover, we found that it is effective for finding hotels that have never been stayed at but share the same neighborhoods.

Paper Nr: 24
Title:

Cross-domain Text Classification through Iterative Refining of Target Categories Representations

Authors:

Giacomo Domeniconi, Gianluca Moro, Roberto Pasolini and Claudio Sartori

Abstract: Cross-domain text classification deals with predicting topic labels for documents in a target domain by leveraging knowledge from pre-labeled documents in a source domain, with different terms or different distributions thereof. Methods exist to address this problem by re-weighting documents from the source domain to transfer them to the target one or by finding a common feature space for documents of both domains; they often require the combination of complex techniques, leading to a number of parameters which must be tuned for each dataset to yield optimal performances. We present a simpler method based on creating explicit representations of topic categories, which can be compared for similarity to the ones of documents. Categories representations are initially built from relevant source documents, then are iteratively refined by considering the most similar target documents, with relatedness being measured by a simple regression model based on cosine similarity, built once at the begin. This expectedly leads to obtain accurate representations for categories in the target domain, used to classify documents therein. Experiments on common benchmark text collections show that this approach obtains results better or comparable to other methods, obtained with fixed empirical values for its few parameters.

Paper Nr: 27
Title:

Arbitrary Shape Cluster Summarization with Gaussian Mixture Model

Authors:

Elnaz Bigdeli, Mahdi Mohammadi, Bijan Raahemi and Stan Matwin

Abstract: One of the main concerns in the area of arbitrary shape clustering is how to summarize clusters. An accurate representation of clusters with arbitrary shapes is to characterize a cluster with all its members. However, this approach is neither practical nor efficient. In many applications such as stream data mining, preserving all samples for a long period of time in presence of thousands of incoming samples is not practical. Moreover, in the absence of labelled data, clusters are representative of each class, and in case of arbitrary shape clusters, finding the closest cluster to a new incoming sample using all objects of clusters is not accurate and efficient. In this paper, we present a new algorithm to summarize arbitrary shape clusters. Our proposed method, called SGMM, summarizes a cluster using a set of objects as core objects, then represents each cluster with corresponding Gaussian Mixture Model (GMM). Using GMM, the closest cluster to the new test sample is identified with low computational cost. We compared the proposed method with ABACUS, a well-known algorithm, in terms of time, space and accuracy for both categorization and summarization purposes. The experimental results confirm that the proposed method outperforms ABACUS on various datasets including syntactic and real datasets.

Paper Nr: 28
Title:

A Media Tracking and News Recommendation System

Authors:

Servet Tasci and Ilyas Cicekli

Abstract: Nowadays, the amount of documents on internet resources is increasing at an unprecedented speed and users are tired of searching important and related ones among enormous amount of documents. Users require a personalized support in sifting through large amounts of available information according to their interests and recommendation systems try to answer this need. In this context, it is crucial to offer user friendly tools that facilitate faster and more accurate access to articles in digital newspapers. In this paper, a time-based recommendation system for news domain is presented. News articles are recommended according to user dynamic and static profiles. User dynamic profiles reflect user past interests and recent interests play much bigger roles in the selection of recommendations. Our recommendation system is a complete content-based recommendation system together with categorization, summarization and news collection modules.

Paper Nr: 31
Title:

Incorporating Ad Hoc Phrases in LSI Queries

Authors:

Roger Bradford

Abstract: Latent semantic indexing (LSI) is a well-established technique for information retrieval and data mining. The technique has been incorporated into a wide variety of practical applications. In these applications, LSI provides a number of valuable capabilities for information search, categorization, clustering, and discovery. However, there are some limitations that are encountered in using the technique. One such limitation is that the classical implementation of LSI does not provide a flexible mechanism for dealing with phrases. In both information retrieval and data mining applications, phrases can have significant value in specifying user information needs. In the classical implementation of LSI, the only way that a phrase can be used in a query is if that phrase has been identified a priori and treated as a unit during the process of creating the LSI index. This requirement has greatly hindered the use of phrases in LSI applications. This paper presents a method for dealing with phrases in LSI-based information systems on an ad hoc basis – at query time, without requiring any prior knowledge of the phrases of interest. The approach is fast enough to be used during real-time query execution. This new capability can enhance use of LSI in both information retrieval and knowledge discovery applications.

Paper Nr: 35
Title:

The Social Score - Determining the Relative Importance of Webpages Based on Online Social Signals

Authors:

Marco Buijs and Marco Spruit

Abstract: There are many ways to determine the importance of Webpages, the most successful one being the PageRank algorithm. In this paper we describe an alternative ranking method that we call the Social Score method. The Social Score of a Webpage is based on the number of likes, tweets, bookmarks and other sorts of intensified information from Social Media platforms. By determining the importance of Webpages based on this kind of information, ranking becomes based on a democratic system instead of a system in which only web authors influence the ranking of results. Based on an experiment we conclude that the Social Score is a great alternative to PageRank that could be used as an additional property to take into account in Web Search Engines.

Paper Nr: 36
Title:

A New Multi-lingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language

Authors:

Dante Degl'Innocenti, Dario De Nart and Carlo Tasso

Abstract: Associating meaningful keyphrases to text documents and Web pages is an activity that can significantly increase the accuracy of Information Retrieval, Personalization and Recommender systems, but the growing amount of text data available is too large for an extensive manual annotation. On the other hand, automatic keyphrase generation can significantly support this activity. This task is already performed with satisfactory results by several systems proposed in the literature, however, most of them focuses solely on the English language which represents approximately more than 50% of Web contents. Only few other languages have been investigated and Italian, despite being the ninth most used language on the Web, is not among them. In order to overcome this shortage, we propose a novel multi-language, unsupervised, knowledge-based approach towards keyphrase generation. To support our claims, we developed DIKpE-G, a prototype system which integrates several kinds of knowledge for selecting and evaluating meaningful keyphrases, ranging from linguistic to statistical, meta/structural, social, and ontological knowledge. DIKpE-G performs well over English and Italian texts.

Paper Nr: 39
Title:

Symmetry Breaking in Itemset Mining

Authors:

Belaïd Benhamou, Saïd Jabbour, Lakhdar Sais and Yakoub Salhi

Abstract: The concept of symmetry has been extensively studied in the field of constraint programming and in propositional satisfiability. Several methods for detection and removal of these symmetries have been developed, and their integration in known solvers of these domain improved dramatically their effectiveness on a large variety of problems considered difficult to solve. The concept of symmetry may be exported to other domains where some structures can be exploited effectively. Particularly in data mining where some tasks can be expressed as constraints. In this paper, we are interested in the detection and elimination of symmetries in the problem of finding frequent itemsets of a transaction database and its variants. Recent works have provided effective encodings as Boolean constraints for these data mining tasks and some recent works on symmetry detection and elimination in itemset mining problems have been proposed. In this work we propose a generic framework that could be used to eliminate symmetries for data mining task expressed in a declarative constraint language. We show how symmetries between the items of the transactions are detected and eliminated by adding symmetry-breaking predicate (SBP) to the Boolean encoding of the data mining task.

Paper Nr: 41
Title:

Enhancing Online Discussion Forums with a Topic-driven Navigational Paradigm - A Plugin for the Moodle Learning Management System

Authors:

Damiano Distante, Luigi Cerulo, Aaron Visaggio and Marco Leone

Abstract: One of the most popular means of asynchronous communication and most rich repository of user generated information over the Internet is represented by online discussion forums. The capability of a forum to satisfy users’ needs as an information source is mainly determined by its richness in information, but also by the way its content (messages and message threads) is organized and made navigable and searchable. To ease content navigation and information search in online discussion forums we propose an approach that introduces in them a complementary navigation structure which enables searching and navigating forum contents by topic of discussion, thus enabling a topic-driven navigational paradigm. Discussion topics and hierarchical relations between them are extracted from the forum textual content with a semi-automatic process, by applying Information Retrieval techniques, specifically Topic Models and Formal Concept Analysis. Then, forum messages and discussion threads are associated to discussion topics on a similarity score basis. In this paper we present an implementation of our approach for the Moodle learning management system, opening to the application of the approach to several real e-learning contexts. We also show with a case study that the new topic-driven navigation structure improves information search tasks with respect to using Moodle standard full-text search.

Paper Nr: 55
Title:

Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations

Authors:

Giacomo Domeniconi, Marco Masseroli, Gianluca Moro and Pietro Pinoli

Abstract: Genomic annotations describing functional features of genes and proteins through controlled terminologies and ontologies are extremely valuable, especially for computational analyses aimed at inferring new biomedical knowledge. Thanks to the biology revolution led by the introduction of the novel DNA sequencing technologies, several repositories of such annotations have becoming available in the last decade; among them, the ones including Gene Ontology annotations are the most relevant. Nevertheless, the available set of genomic annotations is incomplete, and only some of the available annotations represent highly reliable human curated information. In this paper we propose a novel representation of the annotation discovery problem, so as to enable applying supervised algorithms to predict Gene Ontology annotations of different organism genes. In order to use supervised algorithms despite labeled data to train the prediction model are not available, we propose a random perturbation method of the training set, which creates a new annotation matrix to be used to train the model to recognize new annotations. We tested the effectiveness of our approach on nine Gene Ontology annotation datasets. Obtained results demonstrated that our technique is able to improve novel annotation predictions with respect to state of the art unsupervised methods.

Paper Nr: 64
Title:

Towards Analytical MD Stars from Linked Data

Authors:

Victoria Nebot and Rafael Berlanga

Abstract: While the Linked Data (LD) initiative has given place to open, large amounts of semi-structured and rich data published on the Web, effective analytical tools that go beyond browsing and querying are still lacking. To address this issue, we propose the automatic generation of multidimensional (MD) analytical stars. The success of the MD model for data analysis has been in great part due to its simplicity. Therefore, in this paper we aim at automatically discovering MD conceptual patterns that summarize LD. These patterns resemble the MD star schema typical of relational data warehousing. Our method is based on probabilistic graphical models and makes use of the statistics about the instance data to generate the MD stars. We present a first implementation, and the preliminary results with large LD sets are encouraging to further work in this direction.

Paper Nr: 82
Title:

Sentiment Polarity Extension for Context-Sensitive Recommender Systems

Authors:

Octavian Lucian Hasna, Florin Cristian Macicasan, Mihaela Dinsoreanu and Rodica Potolea

Abstract: Opinion mining has become an important field of text mining. The limitations in case of supervised learning refer to domain dependence: a solution is highly dependent (if not specifically designed or at least specifically tuned) on a given data set (or at least specific domain). Our method is an attempt to overcome such limitations by considering the generic characteristics hidden in textual information. We aim to identify the sentiment polarity of documents that are part of different domains with the help of a uniform, cross-domain representation. It relies on three classes of original meta-features that can be used to characterize datasets belonging to various domains. We evaluate our approach using three datasets extensively used in the literature. The results for in-domain and cross-domain verification show that the proposed approach handles novel domains increasingly better as its training corpus grows, thus inducing domain-independence.

Short Papers
Paper Nr: 3
Title:

Performance Evaluation of State-of-the-Art Ranked Retrieval Methods and Their Combinations for Query Suggestion

Authors:

Suthira Plansangket and John Q. Gan

Abstract: This paper investigates several state-of-the-art ranked retrieval methods, adapts and combines them as well for query suggestion. Four performance criteria plus user evaluation have been adopted to evaluate these query suggestion methods in terms of ranking and relevance from different perspectives. Extensive experiments have been conducted using carefully designed eighty test queries which are related to eight topics. The experimental results show that the method developed in this paper, which combines the TF-IDF and Jaccard coefficient methods, is the best method for query suggestion among the six methods evaluated, outperforming the most popularly used TF-IDF method. Furthermore, it is shown that re-ranking query suggestions using Cosine similarity improves the performance of query suggestions.

Paper Nr: 15
Title:

Finding Optimal Exact Reducts

Authors:

Hassan AbouEisha

Abstract: The problem of attribute reduction is an important problem related to feature selection and knowledge discovery. The problem of finding reducts with minimum cardinality is NP-hard. This paper suggests a new algorithm for finding exact reducts with minimum cardinality. This algorithm transforms the initial table to a decision table of a special kind, apply a set of simplification steps to this table, and use a dynamic programming algorithm to finish the construction of an optimal reduct. I present results of computer experiments for a collection of decision tables from UCIML Repository. For many of the experimented tables, the simplification steps solved the problem.

Paper Nr: 17
Title:

Semantic Coherence-based User Profile Modeling in the Recommender Systems Context

Authors:

Roberto Saia, Ludovico Boratto and Salvatore Carta

Abstract: Recommender systems usually produce their results to the users based on the interpretation of the whole historic interactions of these. This canonical approach sometimes could lead to wrong results due to several factors, such as a changes in user taste over time or the use of her/his account by third parties. This work proposes a novel dynamic coherence-based approach that analyzes the information stored in the user profiles based on their coherence. The main aim is to identify and remove from the previously evaluated items those not adherent to the average preferences, in order to make a user profile as close as possible to the user’s real tastes. The conducted experiments show the effectiveness of our approach to remove the incoherent items from a user profile, increasing the recommendation accuracy.

Paper Nr: 25
Title:

SoC Processor Discovery for Program Execution Matching Using Unsupervised Machine Learning

Authors:

Avi Bleiweiss

Abstract: The fast cadence for evolving mobile compute systems, often extends their default processor configuration by incorporating task specific, companion cores. In this setting, the problem of matching a compute program to efficiently execute on a dynamically selected processor, poses a considerable challenge to employing traditional compiler technology. Rather, we propose an unsupervised machine learning methodology that mines a large data corpus of unlabeled compute programs, with the objective to discover optimal program-processor relations. In our work, we regard a compute program as a text document, comprised of a linear sequence of bytecode mnemonics, and further transformed into an effective representation of a bag of instruction term frequencies. Respectively, a set of concise instruction vectors is forwarded onto a finite mixture model, to identify unsolicited cluster patterns of source-target compute pairings, using the expectation-maximization algorithm. For classification, we explore k-nearest neighbor and ranked information retrieval methods, and evaluate our system by simultaneously varying the dimensionality of the training set and the SoC processor formation. We report robust performance results on both the discovery of relational clusters and feature matching.

Paper Nr: 26
Title:

Actors and Factors in IS Process Innovation Decisions

Authors:

Erja Mustonen-Ollila, Jukka Heikkonen and Philip Powell

Abstract: Information system process innovation (ISPI) describes new ways of developing, implementing, and maintaining information systems. This paper investigates ISPI decisions in three organisations over four development generations. The analysis reveals dependencies between the actors and factors in the decision processes; it shows how the actors employ different combinations of factors, and how the factors influence the actors’ decision making. Self-Organizing Map clustering demonstrates that in the three organisations, the combinations of ISPI and actors vary over time, and these variations may be partly explained by power dependency between the organisations. The dependencies identified here are novel. The actors and factors found in past research are validated, and the dependencies between the actors and factors enhance confidence in the validity of the concepts and dependencies, as well as in expanding and emerging theory.

Paper Nr: 29
Title:

Violence Recognition in Spanish Words using Data Mining

Authors:

Adolfo Flores Moreno, Silvia B. González-Brambila and Juan G. Vargas-Rubio

Abstract: Violent behavior in our society has been studied from many points of view, yet many cause-effect relations remain unexplained. Security personnel are normally trained to be alert and recognize potential violent behavior, but they cannot be 100% effective in recognizing it due to the monotonous nature of their job. This paper presents the first results of a work in progress detecting violence from the analysis of words in conversations. We used a set of videos with two person conversations in Spanish and classified them as violent and non violent. The audio of the conversations was extracted and converted to text. We used “Ward”, “K-means” and “PAM” (clValid, 2014) to group words, performing a clValid analysis we found that the hierarchical technique was the best. The percentages of frequency were computed for each term and the SVM (Meyer, 2014) technique was applied, from which we found that there were unclassifiable terms. In three of the tests the prediction was erroneous and in another three we obtained good predictions with respect to the test set.

Paper Nr: 30
Title:

NoSQL Graph-based OLAP Analysis

Authors:

Arnaud Castelltort and Anne Laurent

Abstract: OLAP is a leading technology for analysing data and decision making. It helps the users to discover relevant information from large databases. Graph OLAP has been studied for several years in the OLAP framework. In existing work, the authors study how to import graph data into OLAP cube models but no work has explored yet the feasability to exploit graph structures to store analytical data. As graph databases are more and more used through NoSQL implementations (e.g., social and biological networks), in this paper we aim at providing an original model for managing cubes into NoSQL graphs. We show how cubes can be represented in graphs and how these structures can then be used for graph OLAP queries to support decision making.

Paper Nr: 34
Title:

Comparing Methods for Twitter Sentiment Analysis

Authors:

Evangelos Psomakelis, Konstantinos Tserpes, Dimosthenis Anagnostopoulos and Theodora Varvarigou

Abstract: This work extends the set of works which deal with the popular problem of sentiment analysis in Twitter. It investigates the most popular document ("tweet") representation methods which feed sentiment evaluation mechanisms. In particular, we study the bag-of-words, n-grams and n-gram graphs approaches and for each of them we evaluate the performance of a lexicon-based and 7 learning-based classification algorithms (namely SVM, Naïve Bayesian Networks, Logistic Regression, Multilayer Perceptrons, Best-First Trees, Functional Trees and C4.5) as well as their combinations, using a set of 4451 manually annotated tweets. The results demonstrate the superiority of learning-based methods and in particular of n-gram graphs approaches for predicting the sentiment of tweets. They also show that the combinatory approach has impressive effects on n-grams, raising the confidence up to 83.15% on the 5-Grams, using majority vote and a balanced dataset (equal number of positive, negative and neutral tweets for training). In the n-gram graph cases the improvement was small to none, reaching 94.52% on the 4-gram graphs, using Orthodromic distance and a threshold of 0.001.

Paper Nr: 46
Title:

Text Analysis of User-Generated Contents for Health-care Applications - Case Study on Smoking Status Classification

Authors:

Deema Abdal Hafeth, Amr Ahmed and David Cobham

Abstract: Text mining techniques have demonstrated a potential to unlock significant patient health information from unstructured text. However, most of the published work has been done using clinical reports, which are difficult to access due to patient confidentiality. In this paper, we present an investigation of text analysis for smoking status classification from User-Generated Contents (UGC), such as online forum discussions. UGC are more widely available, compared to clinical reports. Based on analyzing the properties of UGC, we propose the use of Linguistic Inquiry Word Count (LIWC) an approach being used for the first time for such a health-related task. We also explore various factors that affect the classification performance. The experimental results and evaluation indicate that the forum classification performs well with the proposed features. It has achieved an accuracy of up to 75% for smoking status prediction. Furthermore, the utilized features set is compact (88 features only) and independent of the dataset size.

Paper Nr: 47
Title:

Exploiting Social Debates for Opinion Ranking

Authors:

Youssef Meguebli, Mouna Kacimi, Bich-liên Doan and Fabrice Popineau

Abstract: The number of opinions in news media platforms is increasing dramatically with daily news hits, and people spending more and more time to discuss topics and share experiences. Such user generated content represents a promising source for improving the effectiveness of news articles recommendation and retrieval. However, the corpus of opinions is often large and noisy making it hard to find prominent content. In this paper, we tackle this problem by proposing a novel scoring model that ranks opinions based on their relevance and prominence. We define the prominence of an opinion using its relationships with other opinions. To this end, we (1) create a directed graph of opinions where each link represents the sentiment an opinion expresses about another opinion (2) propose a new variation of the PageRank algorithm that boosts the scores of opinions along links with positive sentiments and decreases them along links with negative sentiments. We have tested the effectiveness of our model through extensive experiments using three datasets crawled from CNN, Independent, and The Telegraph Web sites . The experiments show that our scoring model achieves high quality results.

Paper Nr: 49
Title:

Disease Identification in Electronic Health Records - An Ontology based Approach

Authors:

Ioana Barbantan, Camelia Lemnaru and Rodica Potolea

Abstract: Exploiting efficiently medical data from Electronic Health Records (EHRs) is a current joint research focus of the knowledge extraction and the medical communities. EHR structuring is essential for the efficient exploitation of the information they capture. To that end, concept identification and categorization represent key tasks. This paper presents a disease identification approach which applies several NLP document pre-processing steps, queries the SNOMED-CT ontology and then applies a filtering rule on the retrieved information. The hierarchical approach provides a better filtering of the concepts, reducing the amount of falsely identified disease concepts. We have performed a series of evaluations on the Medline abstracts dataset. The results obtained so far are promising – our method achieves a precision of 87.79% and a recall of 87.12%, better than the results obtained by Apache’s cTAKES system on the same task and dataset.

Paper Nr: 52
Title:

A Fusion Approach to Computing Distance for Heterogeneous Data

Authors:

Aalaa Mojahed and Beatriz de la Iglesia

Abstract: In this paper, we introduce heterogeneous data as data about objects that are described by different data types, for example, structured data, text, time series, images etc. We provide an initial definition of a heterogeneous object using some basic data types, namely structured and time series data, and make the definition extensible to allow for the introduction of further data types and complexity in our objects. There is currently a lack of methods to analyse and, in particular, to cluster such data. We then propose an intermediate fusion approach to calculate distance between objects in such datasets. Our approach deals with uncertainty in the distance calculation and provides a representation of it that can later be used to fine tune clustering algorithms. We provide some initial examples of our approach using a real dataset of prostate cancer patients including visualisation of both distances and uncertainty. Our approach is a preliminary step in the clustering of such heterogeneous objects as the distance between objects produced by the fusion approach can be fed to any standard clustering algorithm. Although further experimental evaluation will be required to fully validate the Fused Distance Matrix approach, this paper presents the concept through an example and shows its feasibility. The approach is extensible to other problems with objects represented by different data types, e.g. text or images.

Paper Nr: 54
Title:

Mathematical Foundations of Networks Supporting Cluster Identification

Authors:

Joseph E. Johnson and John William Campbell

Abstract: The author proved that the continuous general linear (Lie) group in n dimensions can be decomposed into (a) a Markov type Lie group (MTLG) preserving the sum of the components of a vector, and (b) an Abelian Lie scaling group that scales each of the components. For a specific Lie basis, the MTLG generated all continuous Markov transformations (a Lie Markov Monoid LMM) and in subsequently published work, proved that every possible network as defined by an n x n connection matrix Cij of non-negative off-diagonal real numbers was isomorphic to the set of LMM. As this defined the diagonal of C, it supported full eigenvalue analysis of the generated Markov Matrix as well as support of Renyi entropies whose spectra ordered the nodes and make comparison of networks now possible. Our new research provides (a) a method of expanding a network topology in different orders of Renyi entropies, (b) the construction of a meta-network of all possible networks of use in network classification, (c) the use of eigenvector analysis of the LMM generated by a network C to provide an agnostic methodology for identifying clusters and (d) an a methodology for identifying clusters in general numeric database tables.

Paper Nr: 57
Title:

Fuzzy Ontology-based System for Personalized Information Retrieval

Authors:

Ghada Besbes, Hajer Baazaoui-Zghal and Henda Ben Ghezala

Abstract: The ambiguity of the natural language in the user’s query and the corpus documents presents a challenge in the ontology building process. Fuzzy ontologies use fuzzy logic in order to deal with vague information and provide a better comprehension of the user’s needs. In this paper, we propose a fuzzy ontology-based system for personalized information retrieval. The architecture of the proposed system is mainly based on a user profile model. It is composed of three ontological dimensions: history, positive preferences and negative preferences. Our aim is to perform an effective information retrieval by extending its understanding of ambiguous concepts and integrating a user profiling based on implicit knowledge acquisition in the search process. The improvement of semantic search has been studied in an evaluation process. The obtained results show that adding the fuzyy ontology-based model enables an improvement of query reformulation.

Paper Nr: 63
Title:

Data Analytics for Power Utility Storm Planning

Authors:

Lan Lin, Aldo Dagnino, Derek Doran and Swapna Gokhale

Abstract: As the world population grows, recent climatic changes seem to bring powerful storms to populated areas. The impact of these storms on utility services is devastating. Hurricane Sandy is a recent example of the enormous damages that storms can inflict on infrastructure, society, and the economy. Quick response to these emergencies represents a big challenge to electric power utilities. Traditionally utilities develop preparedness plans for storm emergency situations based on the experience of utility experts and with limited use of historical data. With the advent of the Smart Grid, utilities are incorporating automation and sensing technologies in their grids and operation systems. This greatly increases the amount of data collected during normal and storm conditions. These data, when complemented with data from weather stations, storm forecasting systems, and online social media, can be used in analyses for enhancing storm preparedness for utilities. This paper presents a data analytics approach that uses real-world historical data to help utilities in storm damage projection. Preliminary results from the analysis are also included.

Paper Nr: 65
Title:

Competency Mining in Large Data Sets - Preparing Large Scale Investigations in Computer Science Education

Authors:

Peter Hubwieser and Andreas Mühling

Abstract: In preparation of large scale surveys on computer science competencies, we are developing proper competency models and evaluation methodologies, aiming to define competencies by sets of exiting questions that are testing congruent abilities. For this purpose, we have to look for sets of test questions that are measuring joint psychometric constructs (competencies) according to the responses of the test persons. We have developed a methodology for this goal by applying latent trait analysis on all combinations of questions of a certain test. After identifying suitable sets of questions, we test the fit of the mono-parametric Rasch Model and evaluate the distribution of person parameters. As a test bed for first feasibility studies, we have utilized the large scale Bebras Contest in Germany 2009. The results show that this methodology works and might result one day in a set of empirically founded competencies in the field of Computational Thinking.

Paper Nr: 67
Title:

A Simple Classification Method for Class Imbalanced Data using the Kernel Mean

Authors:

Yusuke Sato, Kazuyuki Narisawa and Ayumi Shinohara

Abstract: Support vector machines (SVMs) are among the most popular classification algorithms. However, whereas SVMs perform efficiently in a class balanced dataset, their performance declines for class imbalanced datasets. The fuzzy SVMfor class imbalance learning (FSVM-CIL) is a variation of the SVMtype algorithm to accommodate class imbalanced datasets. Considering the class imbalance, FSVM-CIL associates a fuzzy membership to each example, which represents the importance of the example for classification. Based on FSVM-CIL, we present a simple but effective method here to calculate fuzzy memberships using the kernel mean. The kernel mean is a useful statistic for consideration of the probability distribution over the feature space. Our proposed method is simpler than preceding methods because it requires adjustment of fewer parameters and operates at reduced computational cost. Experimental results show that our proposed method is promising.

Paper Nr: 68
Title:

Semantic Annotation of UMLS using Conditional Random Fields

Authors:

Shahad Kudama and Rafael Berlanga

Abstract: In this work, we present a first approximation to the semantic annotation of Unified Medical Language System (UMLS®) concept descriptions based on the extraction of relevant linguistic features and its use in conditional random fields (CRF) to classify them at the different semantic groups provided by UMLS. Experiments have been carried out over the whole set of concepts of UMLS (more than 1 million). The precision scores obtained in the global system evaluation are high, between 70% and 80% approximately, depending on the percentage of semantic information provided as input. Regarding results by semantic group, the precision even reaches the 100% in those groups with highest representation in the selected descriptions of UMLS.

Paper Nr: 71
Title:

Annotating Cohesive Statements of Anatomical Knowledge Toward Semi-automated Information Extraction

Authors:

Kazuo Hara, Ikumi Suzuki, Kousaku Okubo and Isamu Muto

Abstract: Anatomical knowledge written in a textbook is almost completely unreusable computationally, because it is embedded in a cohesive discourse. In discourse contexts, the frequent use of cohesive ties such as reference expressions and coordinated phrases not only troubles the function of automated systems (i.e., natural language parsers) to extract knowledge from the resulting complicated sentences, but also affects the identification of mentions of anatomical named entities (NEs). We propose to revamp the prose style of anatomical textbooks by transforming cohesive discourse into itemized text, which can be accomplished by annotating reference expressions and coordinating conjunctions. Then, automatically, each anaphor will be replaced by its antecedent in each reference expression, and the conjoined elements are distributed to sentences duplicated for each coordinating conjunction connecting phrases. We demonstrate that, compared to the original text, the transformed one is easy for machines to process and hence convenient as a way of identifying mentions of anatomical NEs and their relations. Since the transformed text is human readable as well, we believe our approach provides a promising new model for language resources accessible by both human and machine, improving the computational reusability of textbooks.

Paper Nr: 72
Title:

Identifying Drug Repositioning Targets using Text Mining

Authors:

Eduardo Barçante, Milene Jezuz, Felipe Duval, Ernesto Caffarena, Oswaldo G. Cruz and Fabricio Silva

Abstract: The current scenario of computational biology relies on the know-how of many technological areas, with focus on information, computing, and, particularly on the construction and use of existing Internet databases such as MEDLINE, PubMed and PDB. In recent years, these databases provide an environment to access, integrate and produce new knowledge by storing ever increasing volumes of genetic or protein data. The transformation and management of these data in a different way than from the one that were originally thought can be a challenge for research in biology. The problems appear by the lack of textual structure or appropriate markup tags. The main goal of this work is to explore the PubMed database, the main source of information about health sciences, from the National Library of Medicine. By means of this database of digital textual documents, we aim to develop a method capable of identifying protein terms that will serve as a substrate to laboratory practices for repositioning drugs. In this perspective, in this work we use text mining to extract terms related to protein names in the field of neglected diseases.

Paper Nr: 73
Title:

Mining for Adverse Drug Events on Twitter

Authors:

Felipe Duval, Ernesto Caffarena, Oswaldo Cruz and Fabrício Silva

Abstract: At the post-marketing phase when drugs are used by large populations and for long periods, unexpected adverse events may occur altering the risk-benefit relation of drugs, sometimes requiring a regulatory action. These events at the post-marketing phase require a significant increase in health care since they result in unnecessary damage, often fatal, to patients. Therefore, the early discovery of adverse events in the post-marketing phase is a primary goal of the health system, in particular for pharmacovigilance systems. The main purpose of this paper is to prove that Twitter can be used as a source to find new and already known adverse drug events. This proposal has a prominent social relevance, as it will help pharmacovigilance systems.

Paper Nr: 74
Title:

The GDR Through the Eyes of the Stasi - Data Mining on the Secret Reports of the State Security Service of the former German Democratic Republic

Authors:

Christoph Kuras, Thomas Efer, Christian Adam and Gerhard Heyer

Abstract: The conjunction of NLP and the humanities has been gaining importance over the last years. As part of this development more and more historical documents are getting digitized and can be used as an input for established NLP methods. In this paper we present a corpus of texts from reports of the Ministry of State Security of the former GDR. Although written in a distinctive kind of sublanguage, we show that traditional NLP can be applied with satisfying results. We use these results as a basis for providing new ways of presentation and exploration of the data which then can be accessed by a wide spectrum of users.

Paper Nr: 77
Title:

Sentimental Analysis of Web Financial Reviews - Opportunities and Challenges

Authors:

Changxuan Wan, Tengjiao Jiang, Dexi Liu and Guoqiong Liao

Abstract: Web financial reviews are real-time, comprehensive and authentic. The construction and quantification of Web financial indexes based on Web financial reviews is of great significance for the financial early warning for enterprises. Comparing with product reviews and news commentaries, in Web financial reviews, the opinion targets have more diverse compositions, the frequencies of opinion targets’ occurrence vary greatly, and the sentiment words’ have more diverse parts of speech. These characteristics make the extraction of opinion targets, the construction of Web financial indexes, and opinion targets-based sentimental analysis all more complicated, posing new challenges to natural language processing.

Paper Nr: 78
Title:

Arabic Text Classification using Bag-of-Concepts Representation

Authors:

Alaa Alahmadi, Arash Joorabchi and Abdulhussain E. Mahdi

Abstract: With the exponential growth of Arabic text in digital form, the need for efficient organization, navigation and browsing of large amounts of documents in Arabic has increased. Text Classification (TC) is one of the important subfields of data mining. The Bag-of-Words (BOW) representation model, which is the traditional way to represent text for TC, only takes into account the frequency of term occurrence within a document. Therefore, it ignores important semantic relationships between terms and treats synonymous words independently. In order to address this problem, this paper describes the application of a Bag-of-Concepts (BOC) text representation model for Arabic text. The proposed model is based on utilizing the Arabic Wikipedia as a knowledge base for concept detection. The BOC model is used to generate a Vector Space Model, which in turn is fed into a classifier to categorize a collection of Arabic text documents. Two different machine-learning based classifiers have been deployed to evaluate the effectiveness of the proposed model in comparison to the traditional BOW model. The results of our experiment show that the proposed BOC model achieves an improved performance with respect to BOW in terms of classification accuracy.

Paper Nr: 86
Title:

Marble Initiative - Monitoring the Impact of Events on Customers Opinion

Authors:

M. Fernandes Caíña, R. Díaz Redondo and A. Fernández Vilas

Abstract: Social networks have become a major source of information, opinions and sentiments about almost any subject. The purpose of this work is to provide evidences of the applicability of opinion mining methods to find out how some events may impact into public opinion about a brand, product or service. We report an experiment that mined Twitter data related to two particular brands during specific periods that have been selected from events that was supposed to affect the user’s perception. To find out conclusions, the methodology of the experiment applies several pre-processing techniques to extract sentiment information from the posts (e.g., case alterations, Part-of-Speech tagging using a Natural Language Toolkit, symbols removal, sentence and n-gram separation). The SenticNet 2 Corpus is used for polarity classification by means of a supervised algorithm where several threshold values are defined to mark positive, negative and neutral opinions. A longitudinal inspection of the polarized results on histograms allows identifying the "hot spots" and relating them to real world events. Although this paper shows the finding in our initial experiments, the ultimate goal of the research initiative, which we call Marble, is to provide a cloud solution for early detection of opinion shifts by the automatic classification of events according to their impact on opinion (propagation speed, intensity and duration), and its relationship with the normal behavior around a brand, product or service.

Paper Nr: 88
Title:

Sarcasm Detection using Sentiment and Semantic Features

Authors:

Prateek Nagwanshi and C. E. Veni Madhavan

Abstract: Sarcasm is a figure of speech used to express a strong opinion in a mild manner. It is often used to convey the opposite sense of what is expressed. Automatic recognition of sarcasm is a complex task. Sarcasm detection is of importance in effective opinion mining. Most sarcasm detectors use lexical and pragmatic features for this purpose. We incorporate statistical as well as linguistic features. Our approach considers the semantic and flipping of sentiment as main features. We use machine learning techniques for classification of sarcastic statements. We conduct experiments on different types of data sets, and compare our results with an existing approach in the literature. We also present human evaluation results. We propose to augment the present encouraging results by a new approach of integrating linguistic and cognitive aspects of text processing.

Paper Nr: 90
Title:

Does a “Renaissance Man” Create Good Wikipedia Articles?

Authors:

Jacek Szejda, Marcin Sydow and Dominika Czerniawska

Abstract: We introduce a concept of diversity of interests or versatility of a member of an open-collaboration environment such as Wikipedia and aim to study how versatility influences the work quality. We introduce versatility measure based on entropy. In preliminary experiments on Wikipedia data we indicate the positive role of editors’ versatility on the quality of the articles they co-edit.

Paper Nr: 91
Title:

Fuzzy User Profile Modeling for Information Retrieval

Authors:

Rim Fakhfakh, Anis Ben Ammar and Chokri Ben Amar

Abstract: Given the continued growth in the number of documents available in the social Web, it becomes increasingly difficult for a user to find relevant resources satisfying his information need. Personalization seems to be an efficient manner to improve the retrieval engine effectiveness. In this paper we introduce a personalized image retrieval system based on user profile modeling depending on user’s context. The context includes user comments, rates, tags and preferences extracted from social network. We adopt a fuzzy logic-based user profile modeling due to its flexibility in decision making since user preference are always imprecise. The user has to specify his initial need description by rating concepts and contexts he is interested in. Concepts and contexts are weighted by the user by associating a score and these scores will infer in our fuzzy model to predict the preference degree related to each concept for such context and return the preference degree. Relying on the score affected for each concept and context we deduce its importance to apply then the appropriate fuzzy rule. As for as the experiments, the advanced user profile modeling with fuzzy logic shows more flexibility in the interpretation of the query.

Paper Nr: 94
Title:

A Word Association Based Approach for Improving Retrieval Performance from Noisy OCRed Text

Authors:

Anirban Chakraborty, Kripabandhu Ghosh and Utpal Roy

Abstract: OCR errors hurt retrieval performance to a great extent. Research has been done on modelling and correction of OCR errors. However, most of the existing systems use language dependent resources or training texts for studying the nature of errors. Not much research has been reported on improving retrieval performance from erroneous text when no training data is available. We propose an algorithm of detecting OCR errors and improving retrieval performance from the erroneous corpus. We present two versions of the algorithm: one based on word cooccurrence and the other based on Pointwise Mutual Information. Our algorithm does not use any training data or any language specific resources like thesaurus. It also does not use any knowledge about the language except that the word delimiter is a blank space. We have tested our algorithm on erroneous Bangla FIRE collection and obtained significant improvements.

Paper Nr: 96
Title:

Handling Missing Data in a Tree Species Catalog Proposed for Reforesting Mexico City

Authors:

Héctor Javier Vázquez and Mihaela Juganaru-Mathieu

Abstract: In this paper we present an application of handling missing attribute values for some data about urban forest in Mexico City. The missing attribute values are about pollution tolerance of the trees, around 42% of our observations are incomplete. Classical methods are non applicable without introducing noise. Our proposal is to use successive steps of multiple correspondance analysis. The estimations values are validated with a clustering approach. The complete data can be used for a variety of future applications.

Paper Nr: 99
Title:

Combining Clustering and Classification Approaches for Reducing the Effort of Automatic Tweets Classification

Authors:

Elias de Oliveira, Henrique Gomes Basoni, Marcos Rodrigues Saúde and Patrick Marques Ciarelli

Abstract: The classification problem has got a new importance dimension with the growing aggregated value which has been given to the Social Media such as Twitter. The huge number of small documents to be organized into subjects is challenging the previous resources and techniques that have been using so far. Futhermore, today more than ever, personalization is the most important feature that a system needs to exhibit. The goal of many online systems, which are available in many areas, is to address the needs or desires of each individual user. To achieve this goal, these systems need to be more flexible and faster in order to adapt to the user’s needs. In this work, we explore a variety of techniques with the aim of better classify a large Twitter data set accordingly to a user goal. We propose a methodology where we cascade an unsupervised following by supervised technique. For the unsupervised technique we use standard clustering algorithms, and for the supervised technique we propose the use of a kNN algorithm and a Centroid Based Classifier to perform the experiments. The results are promising because we reduced the amount of work to be done by the specialists and, in addition, we were able to mimic the human assessment decisions 0.7907 of the time, according to the F1-measure.

Paper Nr: 101
Title:

Stories Around You - A Two-Stage Personalized News Recommendation

Authors:

Youssef Meguebli, Mouna Kacimi, Bich-liên Doan and Fabrice Popineau

Abstract: With the tremendous growth of published news articles, a key issue is how to help users find diverse and interesting news stories. To this end, it is crucial to understand and build accurate profiles for both users and news articles. In this paper, we define a user profile based on (1) the set of entities she/he talked about it in her/his comments and (2) the set of key-concepts related to those entities on which the user has expressed a viewpoint. The same information is extracted from the content of each news article to create its profile. These profiles are then matched for the purpose of recommendation using a new similarity measure. We use also the news articles profiles to diversify the list of recommended stories. A first evaluation involving the activities of 150 real users in four news web sites, namely The Independent, The Telegraph, CNN and Aljazeera has shown the effectiveness of our approach compared to recent works.

Paper Nr: 106
Title:

Discover Knowledge on FLOSS Projects Through RepoFinder

Authors:

Francesca Arcelli Fontana, Riccardo Roveda and Marco Zanoni

Abstract: We can retrieve and integrate knowledge of different kinds. In this paper, we focus our attention on FLOSS (Free, Libre and Open Source Software) projects. With this aim, we introduce RepoFinder, a web application we have developed for the discovery, retrieval and analysis of open source software. RepoFinder supports a keyword-based discovery process for FLOSS projects through google-like queries. Moreover, it allows to analyze the projects according to well-known software metrics and other features of the code, and to compare some structural aspects of the different projects. In the paper, we focus on the discovery capabilities of RepoFinder, evaluating them on different project categories and comparing them with a well-known search engine as Google.

Paper Nr: 107
Title:

Wrapper Induction by XPath Alignment

Authors:

Joachim Nielandt, Robin de Mol, Antoon Bronselaer and Guy de Tré

Abstract: Dealing with a huge quantity of semi-structured documents and the extraction of information therefrom is an important topic that is getting a lot of attention. Methods that allow to accurately define where the data can be found are then pivotal in constructing a robust solution, allowing for imperfections and structural changes in the source material. In this paper we investigate a wrapper induction method that revolves around aligning XPath elements (steps), allowing a user to generalise upon training examples he gives to the data extraction system. The alignment is based on a modification of the well known Levenshtein edit distance. When the training example XPaths have been aligned with each other they are subsequently merged into the path that generalises, as precise as possible, the examples, so it can be used to accurately fetch the required data from the given source material.

Posters
Paper Nr: 18
Title:

Towards Student Success Prediction

Authors:

Hana Bydžovská and Michal Brandejs

Abstract: University information systems offer a vast amount of data which potentially contains additional hidden information and relations. Such knowledge can be used to improve the teaching and facilitate the educational process. In this paper, we introduce methods based on a data mining approach and a social network analysis to predict student grade performance. We focus on cases in which we can predict student success or failure with high accuracy. Machine learning algorithms can be employed with the average accuracy of 81.4%. We have defined rules based on grade averages of students and their friends that achieved the precision of 97% and the recall of 53%. We have also used rules based on study-related data where the best two achieved the precision of 96% and the recall was nearly 35%. The derived knowledge can be successfully utilized as a basis for a course enrollment recommender system.

Paper Nr: 21
Title:

A Noise Resilient and Non-parametric Graph-based Classifier

Authors:

Mahdi Mohammadi, Saeed Adel Mehraban, Elnaz Bigdeli, Bijan Raahemi and Ahmad Akbari

Abstract: In this paper, we propose a non-parametric and noise resilient graph-based classification algorithm. In designing the proposed method, we represent each class of dataset as a set of sub-graphs. The main part of the training phase is how to build the classification graph based on the non-parametric k-associated optimal graph algorithm which is an extension of the parametric k-associated graph algorithm. In this paper, we propose a new extension and modification of the training phase of the k-associated optimal graph algorithm. We compare the modified version of the k-associated optimal graph (MKAOG) algorithm with the original k-associated optimal graph algorithm (KAOG). The experimental results demonstrate superior performance of our proposed method in the presence of different levels of noise on various datasets from the UCI repository.

Paper Nr: 22
Title:

A Mobile Location-Aware Recommendation System

Authors:

Semih Utku and Canan Eren Atay

Abstract: Improvements in mobile technology provide greater personal information accessibility, data incorporation, and public resources accessibility, “anytime, anywhere”. Smartphones are not only devices that make phone calls, but have also become a gateway to the Internet. Mobile devices offer the capabilities of usage flexibility, mobility, fast wireless communication, and location-awareness. Location is determined by GPS satellite tracking, position relative to GSM base stations, and the device's media access control. Similarly, usage of social networks is increasing steadily. Widespread usage of social networks introduces new requirements of Internet application. Users of such networks share their ideas and interests, as well as the activities they plan to attend. In addition, they follow other users’ information and shape their planned activities accordingly. In this study, an intelligent context-aware system is described. In this field, context-awareness is a mobile paradigm in which applications can discover and take advantage of contextual information, such as user location, nearby people and devices, and user activity. This system provides an activity list that users plan to attend. Our recommender system creates results based on data mining techniques, by using personal identification data and user activities. The recommender system brings novel methodology to the activity-decision process by utilizing the right location and real-time information.

Paper Nr: 23
Title:

‘Misclassification Error’ Greedy Heuristic to Construct Decision Trees for Inconsistent Decision Tables

Authors:

Mohammad Azad and Mikhail Moshkov

Abstract: A greedy algorithm has been presented in this paper to construct decision trees for three different approaches (many-valued decision, most common decision, and generalized decision) in order to handle the inconsistency of multiple decisions in a decision table. In this algorithm, a greedy heuristic ‘misclassification error’ is used which performs faster, and for some cost function, results are better than ‘number of boundary subtables’ heuristic in literature. Therefore, it can be used in the case of larger data sets and does not require huge amount of memory. Experimental results of depth, average depth and number of nodes of decision trees constructed by this algorithm are compared in the framework of each of the three approaches.

Paper Nr: 45
Title:

Learning Good Opinions from Just Two Words Is Not Bad

Authors:

Darius Andrei Suciu, Vlad Vasile Itu, Alexandru Cristian Cosma, Mihaela Dinsoreanu and Rodica Potolea

Abstract: Considering the wide spectrum of both practical and research applicability, opinion mining has attracted increased attention in recent years. This article focuses on breaking the domain-dependency issues which occur in supervised opinion mining by using an unsupervised approach. Our work devises a methodology by considering a set of grammar rules for identification of opinion bearing words. Moreover, we focus on tuning our method for the best tradeoff between precision-recall, computation complexity and number of seed words while not committing to a specific input data set. The method is general enough to perform well using just 2 seed words therefore we can state that it is an unsupervised strategy. Moreover, since the 2 seed words are class representatives (“good”, “bad”) we claim that the method is domain independent.

Paper Nr: 58
Title:

Time Phrase Parsing for Chinese Text with HowNet Temporal Information Structure

Authors:

Hong-mei Ma, Xiao-yun Wang and Li Qin

Abstract: Time phrase parsing is useful to analyze Chinese text, because it is usually omitted expression in the Chinese text for the coherence and the cohesion. It is important to gain the temporal information, then it can be used to generate the tense of the verb. In this paper, with temporal information structures of HowNet, time phrases are divided into three categories, and a structure is designed to represent the temporal information of time phrases. Then, it puts forward a method to parse time phrases in Chinese text.

Paper Nr: 59
Title:

Web Content Classification based on Topic and Sentiment Analysis of Text

Authors:

Shuhua Liu and Thomas Forss

Abstract: Automatic classification of web content has been studied extensively, using different learning methods and tools, investigating different datasets to serve different purposes. Most of the studies have made use of the content and structural features of web pages. However, previous experience has shown that certain groups of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into consideration. In this study we present a new approach for automatically classifying web pages into pre-defined topic categories. We apply text summarization and sentiment analysis techniques to extract topic and sentiment indicators of web pages. We then build classifiers based on combined topic and sentiment features. A large amount of experiments were carried out. Our results suggest that incorporating the sentiment dimension can indeed bring much added value to web content classification. Topic similarity based classifiers solely did not perform well, but when topic similarity and sentiment features are combined, the classification model performance is significantly improved for many web categories. Our study offers valuable insights and inputs to the development of web detection systems and Internet safety solutions.

Paper Nr: 66
Title:

Identifying Tweets that Contain a ’Heartwarming Story’

Authors:

Manabu Okumura, Yohei Yamaguchi, Masatomo Suzuki and Hiroko Otsuka

Abstract: We present a rather new task of detecting and collecting tweets that contain heartwarming stories from a huge amount of tweets on Twitter in this paper. We also present a method for identifying heartwarming tweets. Our prediction method is based on a supervised learning algorithm in SVM along with features from the tweets. We found by comparing the feature sets that adding sentiment features mostly improves the performance. However, simply adding the features for detecting a story in a tweet (past tense and tweet length) cannot contribute to improving the performance, while adding all the features to the baseline feature set mostly yields the best performance from among the feature sets.

Paper Nr: 81
Title:

Association Rules Between Terms Appreciated by EII for Query Expansion in IR

Authors:

Sourour Belhadj Rhouma and Chiraz Latiri

Abstract: We study in this paper a new approach for mining association rules between terms in the context of Information Retrieval (IR) through the automatic Query Expansion (QE). These are non-redundant association rules resulting from a mining process of a large bilingual corpus and selected according to quality measures. We propose to show the interest and usefulness of association rules by a quality measure other than confidence, especially Entropic Implication Intensity (EII) (Gras and Couturier, 2012). Our experiments was conducted on two collections (English-French) of CLEF 2003. The results showed a significant improvement for QE with association rules selected by EII than confidence.

Paper Nr: 83
Title:

Finding the Frequent Pattern in a Database - A Study on the Apriori Algorithm

Authors:

Najlaa AlHuwaishel, Maram AlAlwan and Ghada Badr

Abstract: This paper is a study on the frequent pattern mining using the Apriori algorithm. We will present the concept of data mining in a high level and explain how the frequent pattern is mined. We will talk about the Apriori algorithm and the problem with this algorithm followed by exploration of some algorithms that considered as an update on the Apriori algorithm in order to enhance the efficiency problem that we explained. We will also compare the selected algorithms taking under consideration: 1.Number of database scan. 2.Number of generated candidate lists. 3.Memory usage.

Paper Nr: 85
Title:

Emergent Induction of L-system Grammar from a String with Deletion-type Transmutation

Authors:

Ryohei Nakano

Abstract: L-system is a computational model to capture the growth process of plants. Once a noise-tolerant grammatical induction called LGIC2 was proposed for deterministic context-free L-systems. LGIC2 induces L-system grammars from a transmuted string mY, employing an emergent approach. That is, frequently appearing substrings are extracted from mY to form grammar candidates. A grammar candidate can be used to generate a string Z; however, the number of grammar candidates gets huge. Thus, LGIC2 introduced three pruning techniques to narrow down candidates to get only promising ones. Candidates having the strongest similarities between mY and Z are selected as the final solutions. So far, LGIC2 has been evaluated for replacement- and insertion-type transmutations. This paper evaluates the performance of LGIC2 for deletion-type transmutation, after slightly modifying the method.

Paper Nr: 87
Title:

Subjectivity and Objectivity in Urban Knowledge Representation

Authors:

Antonia Cataldo, Valerio Cutini, Valerio Di Pinto and Antonio M. Rinaldi

Abstract: The question of subjectivity and objectivity of information is an important open issue in the knowledge engineering research community. In the context of space representation, they have been traditionally considered competing themes in the study of places, particularly in urban ones. This is highlighted by the distance, in terms of cultural training and operational approach, between the professionals of the city: urban planners and urban anthropologists. The growth in modeling capabilities allows a quantitative study of a city but information about the meanings of space elements are often not taken into account. Starting from this basic assumption, our paper aim is to give a novel point of view to integrate subjectivity and objectivity in an operational model. Space Syntax, as a theory and a methodology, is used as a tool to study the objectivity of the urban space. Ontologies, as an approach and a method to formally represent knowledge, is used to provide Space Syntax with the subjectivity of the same spaces.

Paper Nr: 92
Title:

A Method for Evaluating Validity of Piecewise-linear Models

Authors:

Oleg V. Senko, Dmitry S. Dzyba, Ekaterina A. Pigarova, Liudmila Ya. Rozhinskaya and Anna V. Kuznetsova

Abstract: A method for evaluating optimal complexity of regression models is discussed. It is supposed that complicated model must be used only when any simple model fails describe exhaustively regularity that exists in data. At that null hypothesis about exhaustive explanation of data by simple regularity is tested with the help of complicated model. Validity of null hypothesis is evaluated with the help p-value that is calculated with the help of special version of permutation test. An application is discussed where developed technique is used to evaluate if more complicated piecewise-linear regressions must be used instead of simple regressions to describe correctly dependence of parathyroid hormone on vitamin D status.

Paper Nr: 93
Title:

Adaptive Semantic Construction for Diversity-based Image Retrieval

Authors:

Ghada Feki, Anis Ben Ammar and Chokri Ben Amar

Abstract: In recent years, the explosive growth of multimedia databases and digital libraries reveals crucial problems in indexing and retrieving images, what led us to develop our own approach. Our proposed approach TAD consists in disambiguating web queries to build an adaptive semantic for diversity-based image retrieval. In fact, the TAD approach is a puzzle constituted by three main components which are the TAWQU (Thesaurus-Based Ambiguous Web Query Understanding) process, the ASC (Adaptive Semantic Construction) process and the DR (Diversity-based Retrieval) process. The Wikipedia pages represent our main source of information. The NUS-WIDE dataset is the bedrock of our adaptive semantic. Actually, it permits us to perform a respectful evaluation. Fortunately, the experiments demonstrate promising results for the majority of the twelve ambiguous queries.

Paper Nr: 102
Title:

Towards Integrity in Diversity-aware Small Set Selection and Visualisation Tasks

Authors:

Marcin Sydow

Abstract: In this short paper we introduce a novel notion of integrity in diversity-aware selection and visualisation tasks, present motiva- tion for studying this notion and illustrate it on a case study concerning the visualisation of semantic entity summaries. In particular, we propose a novel visual integrity measure for this case study and illustrate it in a preliminary experiment