SSTM 2011 Abstracts

Full Papers
Paper Nr: 2

MAPPING KNOWLEDGE DOMAINS - Combining Symbolic Relations with Graph Theory


Eric SanJuan

Abstract: We present a symbolic and graph-based approach for mapping knowledge domains. The symbolic component relies on shallow linguistic processing of texts to extract multi-word terms and cluster them based on lexico-syntactic relations. The clusters are subjected to graph decomposition basing on inherent graph theoretic properties of association graphs of items (authors-terms, documents-authors, etc). These include the search for complete minimal separators that can decompose the graphs into central (core topics) and peripheral atoms. The methodology is implemented in the TermWatch system and can be used for several text mining tasks. We also mined for frequent itemsets as a means of revealing dependencies between formal concepts in the corpus. A comparison of the frequent itemsets extracted on each dataset and the structure of the central atom shows an interesting overlap. The interesting features of our approach lie in the combination of state-of-the-art techniques from Natural Language Processing (NLP), Clustering and Graph Theory to develop a system and methodology adapted to uncovering hidden sub-structures from texts.

Paper Nr: 4



Shashank Paliwal and Vikram Pudi

Abstract: Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model which assumes that terms of a text document are independent of each other. Such single term analysis of the text completely ignores the underlying (semantic) structure of a document. In the literature, sufficient efforts have been made to enrich BOW representation using phrases and n-grams like bi-grams and tri-grams. These approaches take into account dependency only between adjacent terms or a continuous sequence of terms. However, while some of the dependencies exist between adjacent words, others are more distant. In this paper, we make an effort to enrich traditional document vector by adding the notion of term-pair features. A Term-Pair feature is a pair of two terms of the same document such that they may be adjacent to each other or distant. We investigate the process of term-pair selection and propose a methodology to select potential term-pairs from the given document. Utilizing term proximity between distant terms also allows some flexibility for two documents to be similar if they are about similar topics but with varied writing styles. Experimental results on standard web document data set show that the clustering performance is substantially improved by adding term-pair features.

Paper Nr: 5



Fabio Clarizia, Francesco Colace, Massimo De Santo, Luca Greco and Paolo Napoletano

Abstract: Text classification methods have been evaluated on supervised classification tasks of large datasets showing high accuracy. Nevertheless, due to the fact that these classifiers, to obtain a good performance on a test set, need to learn from many examples, some difficulties may be found when they are employed in real contexts. In fact, most users of a practical system do not want to carry out labeling tasks for a long time only to obtain a better level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large amount of manual labeling tasks. In this paper we propose a new supervised method for single-label text classification, based on a mixed Graph of Terms, that is capable of achieving a good performance, in term of accuracy, when the size of the training set is 1% of the original. The mixed Graph of Terms can be automatically extracted from a set of documents following a kind of term clustering technique weighted by the probabilistic topic model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset and learnt on 1% of the original training set. Results have confirmed the discriminative property of the graph and have confirmed that the proposed method is comparable with existing methods learnt on the whole training set.

Short Papers
Paper Nr: 3



André Lourenço, Liliana Medina, Ana Fred and Joaquim Filipe

Abstract: Unsupervised organisation of documents, and in particular research papers, into meaningful groups is a difficult problem. Using the typical vector-space-model representation (Bag-of-words paradigm), difficulties arise due to its intrinsic high dimensionality, high redundancy of features, and the lack of semantic information. In this work we propose a document representation relying on a statistical feature reduction step, and an enrichment phase based on the introduction of higher abstraction terms, designated as metaterms, derived from text, using as prior knowledge papers topics and keywords. The proposed representation, combined with a clustering ensemble approach, leads to a novel document organization strategy. We evaluate the proposed approach taking as application domain conference papers, topic information being extracted from conference topics or areas. Performance evaluation on data sets from NIPS and INSTICC conferences show that the proposed approach leads to interesting and encouraging results.