SSTM 2013 Abstracts

Full Papers
Paper Nr: 5

Mining the Long Tail of Search Queries - Finding Profitable Patterns


Michael Meisel, Maik Benndorf and Andreas Ittner

Abstract: Many search engine marketing campaigns contain a lot of different search queries with a low frequency referred as "Long Tail". It is not possible to draw reliable conclusions about the performance of a specific search query with low frequency regarding a business goal because of its limited sample size. In this paper we present a method for finding profitable patterns in the long tail of search queries. The method aggregates search queries based on mined patterns and rejects the non profitable groups. We applied our method to a search engine marketing campaign with over 10,000 different search queries and performed an offline test and an online A/B-test to measure the performance of the method.

Short Papers
Paper Nr: 2

Arabase - A Database Combining Different Arabic Resources with Lexical and Semantic Information


Hazem Raafat, Mohamed Zahran and Mohsen Rashwan

Abstract: Language resources are important factor in any NLP application. However, the language resource support for Arabic is poor because the existing Arabic language resources are either scattered, inconsistent or even incomplete. In this paper we discuss the notion of having an integrated Arabic resource leveraging various pre-existing ones. We present a comparison between these resources then we present preliminary fully and semi-automated methods to integrate these resources. This work serves as a bootstrapping for a rich Arabic-Arabic resource with a good potential to interface with WordNet.

Paper Nr: 3

Challenges and Potentials for Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps


Christian Wartena and Montserrat Garcia Alsina

Abstract: Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. First we describe regional innovation systems and the information types that are necessary to create them. Then we discuss the possibilities of text mining and keyword extraction techniques to extract this information from company websites. Finally, we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. This experiment shows what the main challenges are for information extraction from websites for regional innovation systems.

Paper Nr: 4

How to Extract Unit of Measure in Scientific Documents?


Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthelemy and Mathieu Roche

Abstract: A large amount of quantitative data, related to experimental results, is reported in scientific documents in a free form of text. Each quantitative result is characterized by a numerical value often followed by a unit of measure. Extracting automatically quantitative data is a painstaking process because units suffer from different ways of writing within documents. In our paper, we propose to focus on the extraction and identification of the variant units, in order to enrich iteratively the terminological part of an Ontological and Terminological Resource (OTR) and in the end to allow the extraction of quantitative data. Focusing on unit extraction involves two main steps. Since we work on unstructured documents, units are completely drowned in textual information. In the first step, our method aims at handling the crucial time-consuming process of unit location using supervised learning methods. Once the units have been located in the text, the second step of our method consists in extracting and identifying candidate units in order to enrich the OTR. The extracted candidates are compared to units already known in the OTR using a new string distance measure to validate whether or not they are relevant variants. We have made concluding experiments on our two-step method on a set of more than 35000 sentences.