SSTM 2012 Abstracts

Full Papers
Paper Nr: 3

Improving Toponym Disambiguation by Iteratively Enhancing Certainty of Extraction


Mena B. Habib and Maurice van Keulen

Abstract: Named entity extraction (NEE) and disambiguation (NED) have received much attention in recent years. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. This paper addresses two problems with toponym extraction and disambiguation (as a representative example of named entities). First, almost no existing works examine the extraction and disambiguation interdependency. Second, existing disambiguation techniques mostly take as input extracted named entities without considering the uncertainty and imperfection of the extraction process. It is the aim of this paper to investigate both avenues and to show that explicit handling of the uncertainty of annotation has much potential for making both extraction and disambiguation more robust. We conducted experiments with a set of holiday home descriptions with the aim to extract and disambiguate toponyms. We show that the extraction confidence probabilities are useful in enhancing the effectiveness of disambiguation. Reciprocally, retraining the extraction models with information automatically derived from the disambiguation results, improves the extraction models. This mutual reinforcement is shown to even have an effect after several automatic iterations.

Paper Nr: 6

Informativeness-based Keyword Extraction from Short Documents


Mika Timonen, Timo Toivanen, Yue Teng, Chao Chen and Liang He

Abstract: With the rise of user created content on the Internet, the focus of text mining has shifted. Twitter messages and product descriptions are examples of new corpora available for text mining. Keyword extraction, user modeling and text categorization are all areas that are focusing on utilizing this new data. However, as the documents within these corpora are considerably shorter than in the traditional cases, such as news articles, there are also new challenges. In this paper, we focus on keyword extraction from documents such as event and product descriptions, and movie plot lines that often hold 30 to 60 words. We propose a novel unsupervised keyword extraction approach called Informativeness-based Keyword Extraction (IKE) that uses clustering and three levels of word evaluation to address the challenges of short documents. We evaluate the performance of our approach by using manually tagged test sets and compare the results against other keyword extraction methods, such as CollabRank, KeyGraph, Chi-squared, and TF-IDF. We also evaluate the precision and effectiveness of the extracted keywords for user modeling and recommendation and report the results of all approaches. In all of the experiments IKE out-performs the competition.

Short Papers
Paper Nr: 1

Contextual Latent Semantic Networks used for Document Classification


Ondrej Hava, Miroslav Skrbek and Pavel Kordik

Abstract: Widely used document classifiers are developed over a bag-of-words representation of documents. Latent semantic analysis based on singular value decomposition is often employed to reduce the dimensionality of such representation. This approach overlooks word order in a text that can improve the quality of classifier. We propose language independent method that records the context of particular word into a context network utilizing products of latent semantic analysis. Words' contexts networks are combined to one network that represents a document. A new document is classified based on a similarity between its network and training documents networks. The experiments show that proposed classifier achieves better performance than common classifiers especially when a foregoing reduction of dimensionality is significant.

Paper Nr: 5

Measuring Entity Semantic Relatedness using Wikipedia


Liliana Medina, Ana L. N. Fred, Rui Rodrigues and Joaquim Filipe

Abstract: In this paper we propose a semantic relatedness measure between scientific concepts, using Wikipedia as an hierarchical taxonomy. The devised measure examines the length of Wikipedia category path between two concepts, assigning a weight to each category that corresponds to its depth in the hierarchy. This procedure was extended to measure the relatedness between two distinct concept sets (herein referred to as entities), where the amount of shared nodes in the paths computed for all possible concept sets is also integrated in a global relatedness measure index.