SSTM 2014 Abstracts


Full Papers
Paper Nr: 1
Title:

Applying Information-theoretic and Edit Distance Approaches to Flexibly Measure Lexical Similarity

Authors:

Thi Thuy Anh Nguyen and Stefan Conrad

Abstract: Measurement of similarity plays an important role in data mining and information retrieval. Several techniques for calculating the similarities between objects have been proposed so far, for example, lexical-based, structure-based and instance-based measures. Existing lexical similarity measures usually base on either ngrams or Dice’s approaches to obtain correspondences between strings. Although these measures are efficient, they are inadequate in situations where strings are quite similar or the sets of characters are the same but their positions are different in strings. In this paper, a lexical similarity approach combining information-theoretic model and edit distance to determine correspondences among the concept labels is developed. Precision, Recall and F-measure as well as partial OAEI benchmark 2008 tests are used to evaluate the proposed method. The results show that our approach is flexible and has some prominent features compared to other lexical-based methods.

Paper Nr: 5
Title:

Company Mention Detection for Large Scale Text Mining

Authors:

Rebecca J. Passonneau, Tifara Ramelson and Boyi Xie

Abstract: Text mining on a large scale that addresses actionable prediction needs to contend with noisy information in documents, and with interdependencies between the NLP techniques applied and the data representation. This paper presents an initial investigation of the impact of improved company mention detection for financial analytics using Named Entity recognition and coreference. Coverage of company mention detection improves dramatically. Improvement for prediction of stock price varies, depending on the data representation.

Short Papers
Paper Nr: 2
Title:

Mining Tweet Data - Statistic and Semantic Information for Political Tweet Classification

Authors:

Guillaume Tisserant, Mathieu Roche and Violaine Prince

Abstract: This paper deals with the quality of textual features in messages in order to classify tweets. The aim of our study is to show how improving the representation of textual data affects the performance of learning algorithms. We will first introduce our method YYYYY. It generalises less relevant words for tweet classification. Secondly we compare and discuss the types of textual features given by different approaches. More precisely we discuss the semantic specificity of textual features, e.g. Named Entity, HashTag.

Paper Nr: 4
Title:

Automatic Audiovisual Documents Genre Description

Authors:

Manel Fourati, Anis Jedidi and Faiez Gargouri

Abstract: Audiovisual documents are among the most proliferated resources. Faced with these huge quantities produced every day, the lack of significant descriptions without missing the important content arises. The extraction of these descriptions requires an analysis of the audiovisual document’s content. The automation of the process of describing audiovisual documents is essential because of the richness and the diversity of the available analytical criteria. In this paper, we present a method that allows the extraction of a semantic and automatic description from the content such as genre. We chose to describe cinematic audiovisual documents based on the documentation prepared in the pre-production phase of films, namely synopsis. The experimental result on Imdb (Internet Movie Database) and the Wikipedia encyclopedia indicate that our method of genre detection is better than the result of these corpuses.

Posters
Paper Nr: 3
Title:

Combining N-gram based Similarity Analysis with Sentiment Analysis in Web Content Classification

Authors:

Shuhua Liu and Thomas Forss

Abstract: This research concerns the development of web content detection systems that will be able to automatically classify any web page into pre-defined content categories. Our work is motivated by practical experience and observations that certain categories of web pages, such as those that contain hatred and violence, are much harder to classify with good accuracy when both content and structural features are already taken into account. To further improve the performance of detection systems, we bring web sentiment features into classification models. In addition, we incorporate n-gram representation into our classification approach, based on the assumption that n-grams can capture more local context information in text, and thus could help to enhance topic similarity analysis. Different from most studies that only consider presence or frequency count of n-grams in their applications, we make use of tf-idf weighted n-grams in building the content classification models. Our result shows that unigram based models, even though a much simpler approach, show their unique value and effectiveness in web content classification. Higher order n-gram based approaches, especially 5-gram based models that combine topic similarity features with sentiment features, bring significant improvement in precision levels for the Violence and two Racism related web categories.

Paper Nr: 6
Title:

Text Mining Technologies for Database Curation

Authors:

Fabio Rinaldi

Abstract: Text mining technologies, coupled with advanced user interfaces, have a great potential in the life sciences, for example supporting the process of database curation. We present a system which has achieved competitive results in several community-organized evaluations of text mining technologies and we discuss how such technologies can be integrated in a curation workflow.

Paper Nr: 7
Title:

Context-based Disambiguation using Wikipedia

Authors:

Hugo Batista, David Carrega, Rui Rodrigues and Joaquim Filipe

Abstract: This paper addresses the problem of semantic ambiguity, identified in a previous work where we presented an algorithm for quantifying semantic relatedness of entities characterized by a set of features, potentially ambiguous. We propose to solve the feature ambiguity problem by determining the context defined by the non-ambiguous features and then using this context to select the most adequate interpretation of the ambiguous features. As a result, the entity semantic relatedness process will be improved by reducing the probability of using erroneous features due to ambiguous meaning.

Paper Nr: 8
Title:

Semantic Relatedness with Variable Ontology Density

Authors:

Rui Rodrigues, Joaquim Filipe and Ana L. N. Fred

Abstract: In a previous work, we proposed a semantic relatedness measure between scientific concepts, using Wikipedia categories network as an ontology, based on the length of the category path. After observing substantial differences in the arc density of the categories network, across the whole graph, it was concluded that these irregularities in the ontology density may lead to substantial errors in the computation of the semantic relatedness measure. Now we attempt to correct for this bias and improve this measure by adding the notion of ontology density and proposing a new semantic relatedness measure. The proposed measure computes a weighed length of the category path between two concepts in the ontology graph, assigning a different weight to each arc of the path, depending on the ontology density in its region. This procedure has been extended to measure semantic relatedness between entities, an entity being defined as a set of concepts.