KDIR 2018 Abstracts


Full Papers
Paper Nr: 1
Title:

Data Clustering using Homomorphic Encryption and Secure Chain Distance Matrices

Authors:

Nawal Almutairi, Frans Coenen and Keith Dures

Abstract: Secure data mining has emerged as an essential requirement for exchanging confidential data in terms of third party (outsourced) data analytics. An emerging form of encryption, Homomorphic Encryption, allows a limited amount of data manipulation and, when coupled with additional information, can facilitate secure third party data analytics. However, the resource required is substantial which leads to scalability issues. Moreover, in many cases, data owner participation can still be significant, thus not providing a full realisation of the vision of third party data analytics. The focus of this paper is therefore scalable and secure third party data clustering with only very limited data owner participation. To this end, the concept of Secure Chain Distance Matrices is proposed. The mechanism is fully described and analysed in the context of three different clustering algorithms. Excellent evaluation results were obtained.

Paper Nr: 6
Title:

Graph Convolutional Matrix Completion for Bipartite Edge Prediction

Authors:

Yuexin Wu, Hanxiao Liu and Yiming Yang

Abstract: Leveraging intrinsic graph structures in data to improve bipartite edge prediction has become an increasingly important topic in the recent machine learning area. Existing methods, however, are facing open challenges in how to enrich model expressiveness and reduce computational complexity for scalability. This paper addresses both challenges with a novel approach that uses a multi-layer/hop neural network to model a hidden space, and the first-order Chebyshev approximation to reduce training time complexity. Our experiments on benchmark datasets for collaborative filtering, citation network analysis, course prerequisite prediction and drug-target interaction prediction show the advantageous performance of the proposed approach over several state-of-the-art methods.

Paper Nr: 9
Title:

Ranking Quality of Answers Drawn from Independent Evident Passages in Neural Machine Comprehension

Authors:

Avi Bleiweiss

Abstract: Machine comprehension has gained increased interest with the recent release of real-world and large-scale datasets. In this work, we developed a neural model built of multiple coattention encoders to address datasets that draw answers to a query from orthogonal context passages. The novelty of our model is in producing passage ranking based entirely on the answer quality obtained from coattention processing. We show that using instead the search-engine presentation order of indexed web pages, from which evidence articles have been extracted, may affect performance adversely. To evaluate our model, we chose the MSMARCO dataset that allows queries to have anywhere from no answer to multiple answers assembled of words both in and out of context. We report extensive quantitative results to show performance impact of various dataset components.

Paper Nr: 10
Title:

HEXTRATO: Using Ontology-based Constraints to Improve Accuracy on Learning Domain-specific Entity and Relationship Embedding Representation for Knowledge Resolution

Authors:

Hegler Tissot

Abstract: This paper focuses the problem of learning the knowledge low-dimensional embedding representation for entities and relations extracted from domain-specific datasets. Existing embedding methods aim to represent entities and relations from a knowledge graph as vectors in a continuous low-dimensional space. Different approaches have been proposed, being usually evaluated on standard benchmark knowledge graphs, such as Wordnet and Freebase. However, the nature of such data sources prevents those methods of taking advantage of more detailed and enriched metadata, lacking more accurate results on the evaluation tasks. In this paper, we propose HEXTRATO, a novel embedding approach that extends a traditional baseline model TransE by adding ontology-based constraints in order to better capture the relationships between categorised entities and their symbolic representation in the vector space. Our method is evaluated on an adapted version of Freebase, on a publicly available dataset used on machine learning benchmarks, and on two datasets in the clinical domain. Our method outperforms the state-of-the-art accuracy on the link prediction task, evidencing the learnt entity and relation embedding representation can be used to improve more complex embedding models.

Paper Nr: 17
Title:

Beyond k-NN: Combining Cluster Analysis and Classification for Recommender Systems

Authors:

Rabaa Alabdulrahman, Herna Viktor and Eric Paquet

Abstract: Recommendation systems have a wide application in e-business and have been successful in guiding users in their online purchases. The use of data mining techniques, to aid recommendation systems in their goal to learn the correct user profiles, is an active area of research. In most recent works, recommendations are obtained by applying a supervised learning method, notably the k-nearest neighbour (k-NN) algorithm. However, classification algorithms require a class label, and in many applications, such labels are not available, leading to extensive domain expert labelling. In addition, recommendation systems suffer from a data sparsity problem, i.e. the number of items purchased by a customer is typically a small subset of all ĉvailable products. One solution to overcome the labelling and data sparsity problems is to apply cluster analysis techniques prior to classification. Cluster analysis allows one to learn the natural groupings, i.e. similar customer profiles. In this paper, we study the value of applying cluster analysis techniques to customer ratings prior to applying classification models. Our HCC-Learn framework combines content-based analysis in the cluster analysis stage, with collaborative filtering in the recommending stage. Our experimental results show the value of combining cluster analysis and classification against two real-world data sets.

Paper Nr: 21
Title:

Contributions to the Detection of Unreliable Twitter Accounts through Analysis of Content and Behaviour

Authors:

Nuno Guimarães, Álvaro Figueira and Luís Torgo

Abstract: Misinformation propagation on social media has been significantly growing, reaching a major exposition in the 2016 United States Presidential Election. Since then, the scientific community and major tech companies have been working on the problem to avoid the propagation of misinformation. For this matter, research has been focused on three major sub-fields: the identification of fake news through the analysis of unreliable posts, the propagation patterns of posts in social media, and the detection of bots and spammers. However, few works have tried to identify the characteristics of a post that shares unreliable content and the associated behaviour of its account. This work presents four main contributions for this problem. First, we provide a methodology to build a large knowledge database with tweets who disseminate misinformation links. Then, we answer research questions on the data with the goal of bridging these problems to similar problem explored in the literature. Next, we focus on accounts which are constantly propagating misinformation links. Finally, based on the analysis conducted, we develop a model to detect social media accounts that spread unreliable content. Using Decision Trees, we achieved 96% in the F1-score metric, which provides reliability on our approach.

Paper Nr: 23
Title:

Predicting Violent Behavior using Language Agnostic Models

Authors:

Yingjie Liu, Gregory Wert, Benjamin Greenawald, Mohammad Al Boni and Donald E. Brown

Abstract: Groups advocating violence have caused significant destruction to individuals and societies. To combat this, governmental and non-governmental organizations must quickly identify violent groups and limit their exposure. While some groups are well-known for their violence, smaller, less recognized groups are difficult to classify. However, using texts from these groups, we may be able to identify them. This paper applies text analysis techniques to differentiate violent and non-violent groups using discourses from various value-motivated groups. Significantly, the algorithms are constructed to be language-agnostic. The results show that deep learning models outperform traditional models. Our models achieve high accuracy when fairly trained only on data from other groups. Additionally, the results indicate that the models achieve better performance by removing groups with a large amount of documents that can bias the classification. This study shows promise in using scalable, language-independent techniques to effectively identify violent value-motivated groups.

Paper Nr: 32
Title:

Iterated Algorithmic Bias in the Interactive Machine Learning Process of Information Filtering

Authors:

Wenlong Sun, Olfa Nasraoui and Patrick Shafto

Abstract: Early supervised machine learning (ML) algorithms have used reliable labels from experts to build predictions. But recently, these algorithms have been increasingly receiving data from the general population in the form of labels, annotations, etc. The result is that algorithms are subject to bias that is born from ingesting unchecked information, such as biased samples and biased labels. Furthermore, people and algorithms are increasingly engaged in interactive processes wherein neither the human nor the algorithms receive unbiased data. Algorithms can also make biased predictions, known as algorithmic bias. We investigate three forms of iterated algorithmic bias and how they affect the performance of machine learning algorithms. Using controlled experiments on synthetic data, we found that the three different iterated bias modes do affect the models learned by ML algorithms. We also found that Iterated filter bias, which is prominent in personalized user interfaces, can limit humans’ ability to discover relevant data.

Paper Nr: 53
Title:

Peer Assessment and Knowledge Discovering in a Community of Learners

Authors:

Maria De Marsico, Filippo Sciarrone, Andrea Sterbini and Marco Temperini

Abstract: Thanks to the exponential growth of the Internet, Distance Education is becoming more and more strategic in many fields of daily life. Its main advantage is that students can learn through appropriate web platforms that allow them to take advantage of multimedia and interactive teaching materials, without constraints neither of time nor of space. Today, in fact, the Internet offers many platforms suitable for this purpose, such as Moodle, ATutor and others. Coursera is another example of a platform that offers different courses to thousands of enrolled students. This approach to learning is, however, posing new problems such as that of the assessment of the learning status of the learner in the case where there were thousands of students following a course, as is in Massive On-line Courses (MOOC). The Peer Assessment can therefore be a solution to this problem: evaluation takes place between peers, creating a dynamic in the community of learners that evolves autonomously. In this article, we present a first step towards this direction through a peer assessment mechanism led by the teacher who intervenes by evaluating a very small part of the students. Through a mechanism based on machine learning, and in particular on a modified form of K-NN, given the teacher’s grades, the system should converge towards an evaluation that is as similar as possible to the one that the teacher would have given. An experiment is presented with encouraging results.

Paper Nr: 55
Title:

Cross-domain & In-domain Sentiment Analysis with Memory-based Deep Neural Networks

Authors:

Gianluca Moro, Andrea Pagliarani, Roberto Pasolini and Claudio Sartori

Abstract: Cross-domain sentiment classifiers aim to predict the polarity, namely the sentiment orientation of target text documents, by reusing a knowledge model learned from a different source domain. Distinct domains are typically heterogeneous in language, so that transfer learning techniques are advisable to support knowledge transfer from source to target. Distributed word representations are able to capture hidden word relationships without supervision, even across domains. Deep neural networks with memory (MemDNN) have recently achieved the state-of-the-art performance in several NLP tasks, including cross-domain sentiment classification of large-scale data. The contribution of this work is the massive experimentations of novel outstanding MemDNN architectures, such as Gated Recurrent Unit (GRU) and Differentiable Neural Computer (DNC) both in cross-domain and in-domain sentiment classification by using the GloVe word embeddings. As far as we know, only GRU neural networks have been applied in cross-domain sentiment classification. Sentiment classifiers based on these deep learning architectures are also assessed from the viewpoint of scalability and accuracy by gradually increasing the training set size, and showing also the effect of fine-tuning, an explicit transfer learning mechanism, on cross-domain tasks. This work shows that MemDNN based classifiers improve the state-of-the-art on Amazon Reviews corpus with reference to document-level cross-domain sentiment classification. On the same corpus, DNC outperforms previous approaches in the analysis of a very large in-domain configuration in both binary and fine-grained document sentiment classification. Finally, DNC achieves accuracy comparable with the state-of-the-art approaches on the Stanford Sentiment Treebank dataset in both binary and fine-grained single-sentence sentiment classification.

Short Papers
Paper Nr: 4
Title:

Estimating the Value of Multiplayer Modes in Video Games: An Analysis of Sales, Ratings, and Utilization Rates

Authors:

Eric Nelson Bailey and Kazunori Miyata

Abstract: Video game organizations are under pressure from growing development costs and competition from other sources of entertainment. Making informed project scope decisions is critical to avoiding budgetary and schedule waste, but little information regarding best development practices is made public, and many decisions are made, not informed by past data, but by the tacit knowledge of project owners. One way developers have attempted to improve sales on their games and follow-up content is through the inclusion of online multiplayer functionality. However, multiplayer functionality is expensive to develop and can significantly add to the costs and schedule of a game project, particularly if the decision to include it is made late in the process. This research explores publicly available data to discover the value that multiplayer functionality provides so that project owners can make a more informed decision earlier in the development process.

Paper Nr: 5
Title:

A Method to Discover Digital Collaborative Conversations in Business Collaborations

Authors:

Antoine Flepp, Julie Dugdale, Fabrice Bourge and Tiphaine Marie-Cardot

Abstract: Many companies have a suite of digital tools, such as Enterprise Social Networks, conferencing and document sharing software, and email, to facilitate collaboration among employees. During, or at the end of a collaboration, documents are often produced. People who were not involved in the initial collaboration often have difficulties understanding parts of its content because they are lacking the overall context. We argue there is valuable contextual and collaborative knowledge contained in these tools (content and use) that can be used to understand the document. Our goal is to rebuild the conversations that took place over a messaging service and their links with a digital conferencing tool during document production. The novelty in our approach is to combine several conversation-threading methods to identify interesting links between distinct conversations. Specifically we combine header-field information with social, temporal and semantic proximities. Our findings suggest the messaging service and conferencing tool are used in a complementary way. The primary results confirm that combining different conversation threading approaches is efficient to detect and construct conversation threads from distinct digital conversations concerning the same document.

Paper Nr: 14
Title:

Selecting Relevance Thresholds to Improve a Recommender System in a Parliamentary Setting

Authors:

Luis M. de Campos, Juan M. Fernández-Luna, Juan F. Huete and Luis Redondo-Expósito

Abstract: In the context of building a recommendation/filtering system to deliver relevant documents to the Members of Parliament (MPs), we have tackled this problem by learning about their political interests by mining their parliamentary activity using supervised classification methods. The performance of the learned text classifiers, one for each MP, depends on a critical parameter, the relevance threshold. This is used by comparing it with the numerical score returned by each classifier and then deciding whether the document being considered should be sent to the corresponding MP. In this paper we study several methods which try to estimate the best relevance threshold for each MP, in the sense of maximizing the system performance. Our proposals are experimentally tested with data from the regional Andalusian Parliament at Spain, more precisely using the textual transcriptions of the speeches of the MPs in this parliament.

Paper Nr: 16
Title:

Evaluating Better Document Representation in Clustering with Varying Complexity

Authors:

Stephen Bradshaw and Colm O’Riordan

Abstract: Micro blogging has become a very popular activity and the posts made by users can be a valuable source of information. Classifying this content accurately can be a challenging task due to the fact that comments are typically short in nature and on their own may lack context. Reddita is a very popular microblogging site whose popularity has seen a huge and consistent increase over the years. In this paper we propose using alternative but related Reddit threads to build language models that can be used to disambiguate intend mean of terms in a post. A related thread is one which is similar in content, often consisting of the same frequently occurring terms or phrases. We posit that threads of a similar nature use similar language and that the identification of related threads can be used as a source to add context to a post, enabling more accurate classification. In this paper, graphs are used to model the frequency and co-occurrence of terms. The terms of a document are mapped to nodes, and the co-occurrence of two terms are recorded as edge weights. To show the robustness of our approach, we compare the performance in using related Reddit threads to the use of an external ontology; Wordnet. We apply a number of evaluation metrics to the clusters created and show that in every instance, the use of alternative threads to improve document representations is better than the use of Wordnet or standard augmented vector models. We apply this approach to increasingly harder environments to test the robustness of our approach. A tougher environment is one where the classifying algorithm has more than two categories to choose from when selecting the appropriate class.

Paper Nr: 18
Title:

A Novel Framework to Represent Documents using a Semantically-grounded Graph Model

Authors:

Antonio M. Rinaldi and Cristiano Russo

Abstract: As an increasing number of text-based documents, whose complexity increases in turn, are available over the Internet, it becomes obvious that handling such documents as they are, i.e. in their original natural-language based format, represents a daunting task to face up for computers. Thus, some methods and techniques have been used and refined, throughout the last decades, in order to transform the digital documents from the full text version to another suitable representation, making them easier to handle and thus helping users in getting the right information with a reduced algorithmic complexity. One of the most spread solution in document representation and retrieval has consisted in transforming the full text version into a vector, which describes the contents of the document in terms of occurrences patterns of words. Although the wide adoption of this technique, some remarkable drawbacks have been soon pointed out from the researchers’ community, mainly focused on the lack of semantics for the associated terms. In this work, we use WordNet as a generalist linguistic database in order to enrich, at a semantic level, the document representation by exploiting a label and properties based graph model, implemented in Neo4J. This work demonstrates how such representation allows users to quickly recognize the document topics and lays the foundations for cross-document relatedness measures that go beyond the mere word-centric approach.

Paper Nr: 22
Title:

Defining Dynamic Indicators for Social Network Analysis: A Case Study in the Automotive Domain using Twitter

Authors:

Indira Lázara Lanza Cruz and Rafael Berlanga Llavori

Abstract: In this paper we present a framework based on Linked Open Data Infrastructures to perform analysis tasks in social networks based on dynamically defined indicators. Based on the typical stages of business intelligence models, which starts from the definition of strategic goals to define relevant indicators (Key Performance Indicators), we propose a new scenario where the sources of information are the social networks. The fundamental contribution of this work is to provide a framework for easily specifying and monitoring social indicators based on the measures offered by the APIs of the most important social networks. The main novelty of this method is that all the involved data and information is represented and stored as Linked Data. In this work we demonstrate the benefits of using linked open data, especially for processing and publishing company-specific social metrics and indicators.

Paper Nr: 25
Title:

Theatrical Genre Prediction using Social Network Metrics

Authors:

Manisha Shukla, Susan Gauch and Lawrence Evalyn

Abstract: With the emergence of digitization, large text corpora are now available online which provide humanities scholars an opportunity to perform literary analysis leveraging the use of computational techniques. Almost no work has been done to study the ability of mathematical properties of network graphs to predict literary features. In this paper, we apply network theory concepts in the field of literature to explore correlations between the mathematical properties of the social networks of plays and the plays’ dramatic genre. Our goal is to find metrics which can distinguish between theatrical genres without needing to consider the specific vocabulary of the play. We generated character interaction networks of 36 Shakespeare plays and tried to differentiate plays based on social network features captured by the character network of each play. We were able to successfully predict the genre of Shakespeare’s plays with the help of social network metrics and hence establish that differences of dramatic genre are successfully captured by the local and global social network metrics of the plays. Since the technique is highly extensible, future work can be applied larger groups of plays, including plays written by different authors, from different periods, or even in different languages.

Paper Nr: 29
Title:

LAST.FM Songs Database: A Database for Musical Genre Classification

Authors:

Paulo Sergio da Conceição Moreira and Denise Fukumi Tsunoda

Abstract: In this article, we explore the development of a new database for the classification of musical genres called LAST.FM songs database. We used Last FM to verify the most popular music of seven musical genres and, from this, we apply concepts and techniques of extraction of audio signal characteristics to work out a new database for studies related to Music Information Retrieval. We extracted a set of 24 characteristics obtained based on the analysis of the audio signal in the time and frequency domains. As a contribution, we present the LAST.FM songs database, database that presents two Brazilian musical genres (MPB and Samba) that are rarely explored by other works and which present a strong similarity between them.

Paper Nr: 31
Title:

A Semantically Aware Explainable Recommender System using Asymmetric Matrix Factorization

Authors:

Mohammed Alshammari, Olfa Nasraoui and Behnoush Abdollahi

Abstract: Matrix factorization is an accurate collaborative filtering method for predicting user preferences. However, it is a black box system that lacks transparency, providing little information about both users and items in comparison with white box systems. White box systems can easily generate explanations, relying on the rich information foundation that these systems exploit in an explicit manner. However, the accuracy of recommendations is generally low. In this work, we take advantage of the Semantic Web in the process of building a black box model which can make recommendations that can be explained. Our experiments show that our proposed method succeeds in producing lower error rates and in explaining its outputs.

Paper Nr: 34
Title:

PrCP: Pre-recommendation Counter-Polarization

Authors:

Mahsa Badami, Olfa Nasraoui and Patrick Shafto

Abstract: Personalized recommender systems are commonly used to filter information in social media, and recommendations are derived by training machine learning algorithms on these data. It is thus important to understand how machine learning algorithms, especially recommender systems, behave in polarized environments. We investigate how filtering and discovering information are affected by using recommender systems. We study the phenomenon of polarization within the context of the users interactions with a space of items and how this affects recommender systems. We then investigate the behavior of machine learning algorithms in such environments. Finally we propose a new recommendation model based on Matrix Factorization for existing collaborative filtering recommender systems in order to combating over-specialization in polarized environments toward counteracting polarization in human-generated data and machine learning algorithms.

Paper Nr: 35
Title:

Survey on Sentiment Analysis and Natural Computing Algorithms

Authors:

Denise Fukumi Tsunoda and Alex Sebastião Constâncio

Abstract: Opinion Mining and Sentiment Analysis on Social Media is on focus for many researchers, who experiment several different techniques to overcome the usual problems found in the area. On the other hand, the study field of Natural Computing offers a plethora of algorithms and strategies that can be used to tackle many kinds of problems. This paper presents a survey on the confluence of those two major themes, listing and summarizing the very recent efforts that use Natural Computing techniques on the specific field of Opinion Mining and Sentiment Analysis.

Paper Nr: 36
Title:

Glottal Attributes Extracted from Speech with Application to Emotion Driven Smart Systems

Authors:

Alexander Iliev and Peter L. Stanchev

Abstract: Any novel smart system development depends on human-computer interaction and is also dependent either directly or indirectly on the emotion of the user. In this paper we propose an idea for the development of a smart system using sentiment extraction from speech with possible application in various areas in our everyday life. Two different speech corpora were used for cross-validation with training and testing on each set. The system is text, content and gender independent. Emotions were extracted from both female and male speakers. The system is robust to external noise and can be implemented in areas such as entertainment, personalization, system automation, service industries, security, surveillance, and many more.

Paper Nr: 37
Title:

The Potential Use of Bioinspired Algorithms Applied in the Segmentation of Mammograms

Authors:

David González-Patiño, Yenny Villuendas-Rey and Amadeo J. Argüelles-Cruz

Abstract: In this article, we present the potential use of bioinspired algorithms for segmentation. The comparison is done with 3 bioinspired algorithms and Otsu method, which is an algorithm currently used to perform image segmentation. The vast majority of bioinspired algorithms were designed for optimization, however in this work, an adjustment is done to use the function to be optimized as a function that allows us to segment an image. The results in this work showed that the bioinspired algorithms are a good alternative to perform the task of segmentation in medical images, specifically mammography.

Paper Nr: 42
Title:

Fast Document Similarity Computations using GPGPU

Authors:

Parijat Shukla and Arun K. Somani

Abstract: Several Big Data problems involve computing similarities between entities, such as records, documents, etc., in timely manner. Recent studies point that similarity-based deduplication techniques are efficient for document databases. Delta encoding-like techniques are commonly leveraged to compute similarities between documents. Operational requirements dictate low latency constraints. The previous researches do not consider parallel computing to deliver low latency delta encoding solutions. This paper makes two-fold contribution in context of delta encoding problem occurring in document databases: (1) develop a parallel processing-based technique to compute similarities between documents, and (2) design a GPU-based document cache framework to accelerate the performance of delta encoding pipeline. We experiment with real datasets. We achieve throughput of more than 500 similarity computations per millisecond. And the similarity compuatation framework achieves a throughput in the range of 237-312 MB per second which is up to 10X higher throughput when compared to the hashing-based approaches.

Paper Nr: 45
Title:

Using Google Books Ngram in Detecting Linguistic Shifts over Time

Authors:

Alaa El-Ebshihy, Nagwa El-Makky and Khaled Nagi

Abstract: The availability of large historical corpora, such as Google Books Ngram, makes it possible to extract various meta information about the evolution of human languages. Together with advances in machine learning techniques, researchers recently use the huge corpora to track cultural and linguistic shifts in words and terms over time. In this paper, we develop a new approach to quantitatively recognize semantic changes of words during the period between 1800 and 1990. We use the state-of-the-art FastText approach to construct word embedding for Google Books Ngram corpus for the decades within the time period 1800-1990. We use a time series analysis to identify words that have a statistically significant change in the period between 1900 and 1990. We conduct a performance evaluation study to compare our approach against related work, we show that our system is more robust against morphological language variations.

Paper Nr: 46
Title:

A Data Mining Service for Non-Programmers

Authors:

Artur Pedroso, Bruno Leonel Lopes, Jaime Correia, Filipe Araujo, Jorge Cardoso and Rui Pedro Paiva

Abstract: With the emergence of Big Data, the scarcity of data scientists to analyse all the data being produced in different domains became evident. To train new data scientists faster, web applications providing data science practices without requiring programming skills can be a great help. However, some available web applications lack in providing good data mining practices, specially for assessment and selection of models. Thus, in this paper we describe a system, currently under development, that will provide the construction of data mining processes enforcing good data mining practices. The system will be available through a web UI and will follow a microservices architecture that is still being designed and tested. Preliminary usability tests, were conducted with two groups of users to evaluate the envisioned concept for the creation of data mining processes. In these tests we observed a general high level of user satisfaction. To assess the performance of the current system design, we have done tests in a public cloud where we observed interesting results that will guide us in new directions.

Posters
Paper Nr: 2
Title:

A Probabilistic-driven Ensemble Approach to Perform Event Classification in Intrusion Detection System

Authors:

Roberto Saia, Salvatore Carta and Diego Reforgiato Recupero

Abstract: Nowadays, it is clear how the network services represent a widespread element, which is absolutely essential for each category of users, professional and non-professional. Such a scenario needs a constant research activity aimed to ensure the security of the involved services, so as to prevent any fraudulently exploitation of the related network resources. This is not a simple task, because day by day new threats arise, forcing the research community to face them by developing new specific countermeasures. The Intrusion Detection System (IDS) covers a central role in this scenario, as its main task is to detect the intrusion attempts through an evaluation model designed to classify each new network event as normal or intrusion. This paper introduces a Probabilistic-Driven Ensemble (PDE) approach that operates by using several classification algorithms, whose effectiveness has been improved on the basis of a probabilistic criterion. A series of experiments, performed by using real-world data, show how such an approach outperforms the state-of-the-art competitors, proving its better capability to detect intrusion events with regard to the canonical solutions.

Paper Nr: 3
Title:

Sentiment Classification using N-ary Tree-Structured Gated Recurrent Unit Networks

Authors:

Vasileios Tsakalos and Roberto Henriques

Abstract: Recurrent Neural Networks(RNN) is a good way of modeling sequences. However this type of Artificial Neural Networks(ANN) has two major drawbacks, it is not good at capturing long range connections and it is not robust at the vanishing gradient problem(Hochreiter, 1998). Luckily, there have been invented RNNs that can deal with these problems. Namely, Gated Recurrent Units(GRU) networks(Chung et al., 2014)(Gülçehre et al., 2013) and Long Short Term Memory(LSTM) networks(Hochreiter and Schmidhuber, 1997). Many problems in Natural Language Processing can be approximated with a sequence model. But, it is known that the syntactic rules of natural language have a recursive structure(Socher et al., 2011b). Therefore a Recursive Neural Network(Goller and Kuchler, 1996) can be a great alternative. Kai Sheng Tai (Tai et al., 2015) has come up with an architecture that gives the good properties of LSTM in a Recursive Neural Network. In this report, we will present another alternative of Recursive Neural Networks combined with GRU which performs very similar on binary and fine-grained Sentiment Classification (on Stanford Sentiment Treebank dataset) with N-ary Tree-Structured LSTM but is trained faster.

Paper Nr: 12
Title:

A Hybrid Approach to Question-answering for a Banking Chatbot on Turkish: Extending Keywords with Embedding Vectors

Authors:

Enes Burak Dündar, Tolga Çekiç, Onur Deniz and Seçil Arslan

Abstract: In this paper, we have proposed a hybrid keyword-word embedding based question answering method for a Turkish chatbot in banking domain. This hyrid model is supported with the keywords that we have determined by clustering customer messages of Facebook Messenger bot of the bank. Word embedding models are utilized in order to represent words in a better way such that similarity between words can be meaningful. Keyword based similarity calculation between two questions enhanced our chatbot system. In order to evaluate performance of the proposed system, we have created a test data which contains questions with their corresponding answers. Questions in the test are paraphrased versions of ones in our dataset.

Paper Nr: 13
Title:

TS Artificial Neural Networks Classification: A Classification Approach based on Time & Signal

Authors:

Fatima Zahrae Ait Omar and Najib Belkhayat

Abstract: Artificial neural networks (ANN) have become the state-of-the-art technique to tackle highly complex problems in AI due to their high prediction and features’ extraction ability. The recent development in this technology has broadened the jungle of the existing ANNs architectures and caused the field to be less accessible to novices. It is increasingly difficult for new beginners to categorize the architectures and pick the best and well-suited ones for their study case, which makes the need to summarize and classify them undeniable. Many previous classifications tried to meet this aim but failed to clear up the use case for each architecture. The aim of this paper is to provide guidance and a clear overview to beginners and non-experts and help them choose the right architecture for their research without having to dig deeper in the field. The classification suggested for this purpose is performed according to two dimensions inspired of the brain’s perception of the outside world: the time scale upon which the data is collected and the signal nature.

Paper Nr: 19
Title:

Methodology for Text Classification using Manually Created Corpora-based Sentiment Dictionary

Authors:

Nina Rizun and Wojciech Waloszek

Abstract: This paper presents the methodology of Textual Content Classification, which is based on a combination of algorithms: preliminary formation of a contextual framework for the texts in particular problem area; manual creation of the Hierarchical Sentiment Dictionary (HSD) on the basis of a topically-oriented Corpus; tonality texts recognition via using HSD for analysing the documents as a collection of topically completed fragments (paragraphs). For verification of the proposed methodology a case study of Polish-language film reviews Corpora was used. The main scientific contributions of this research are: writing style of the analyzed text determines the possibility of adaptation of the Texts Classification algorithms; Hierarchically-oriented Structure of the HSD allows customizing the classification process to qualitative recognition of text tonality in the context of individual paragraphs topics; texts of Persuasive style most often are initially empowered by authors with a certain tonality. The tone, expressed in the author's opinion, effects the qualitative indicators of sentiment recognition. Negative emotions of the author usually reduce the level of vocabulary variability as well as the variety of topics raised in the document, but simultaneously increase the level of unpredictability of words contextually used with both positive and negative emotional coloring.

Paper Nr: 27
Title:

Investigating the Use of Semantic Relatedness at Document and Passage Level for Query Augmentation

Authors:

Ghulam Sarwar and Stephen Bradshaw

Abstract: This paper documents an approach that i) uses graphs to capture the semantic relatedness between terms in text and ii) augmenting queries with those terms deemed to be semantically related to the query terms. In building the graphs we use a relatively straightforward approach based on term locations; we investigate approaches that aid query improvement by capturing the semantic relatedness that is extracted at passage level as well as the complete document level. Semantic relatedness between is represented on a graph, where the terms are stored as nodes and the strength of their connection is recorded as an edge weight. In this fashion, we recorded the degree of connection between terms and use this to suggest possible additional words for improving the precision of a query. We compare the results of both approaches to a traditional approach and present a number of experiments at passage and document level. Our findings are that the approaches investigated achieve a competitive standard against a well known baseline.

Paper Nr: 28
Title:

Discovering Trends in Brand Interest through Topic Models

Authors:

Diana Lopes-Teixeira, Fernando Batista and Ricardo Ribeiro

Abstract: Topic Modeling is a well-known unsupervised learning technique used when dealing with text data. It is used to discover latent patterns, called topics, in a collection of documents (corpus). This technique provides a convenient way to retrieve information from unclassified and unstructured text. Topic Modeling tasks have been performed for tracking events/topics/trends in different domains such as academic, public health, marketing, news, and so on. In this paper, we propose a framework for extracting topics from a large dataset of short messages, for brand interest tracking purposes. The framework consists training LDA topic models for each brand using time intervals, and then applying the model on aggregated documents. Additionally, we present a set of preprocessing tasks that helped to improve the topic models and the corresponding outputs. The experiments demonstrate that topic modeling can successfully track people’s discussions on Social Networks even in massive datasets, and capture those topics spiked by real-life events.

Paper Nr: 30
Title:

MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications

Authors:

Maria F. De La Torre, Carlos A. Aguirre, BreAnn M. Anshutz and William H. Hsu

Abstract: This paper addresses the task of extracting free-text sections from scientific PDF documents, and specifically the problem of formatting disparity among different publications, by analysing their metadata. For the purpose of extracting procedural knowledge in the form of recipes from papers, and for the application domain of nanomaterial synthesis, we present Metadata-Analytic Text and Section Extractor (MATESC), a heuristic rule-based pattern analysis system for text extraction and section classification from scientific literature. MATESC extracts text spans and uses metadata features such as spatial layout location, font type, and font size to create grouped blocks of text and classify them into groups and subgroups based on rules that characterize specific paper sections. The main purpose of our tool is to facilitate information and semantic knowledge extraction across different domain topics and journal formats. We measure the accuracy of MATESC using string matching algorithms to compute alignment costs between each section extracted by our tool and manually-extracted sections. To test its transferability across domains, we measure its accuracy on papers that are relevant to the papers that were used to determine our rule-based methodology and also on random papers crawled from the web. In the future, we will use natural language processing to improve paragraph grouping and classification.

Paper Nr: 33
Title:

A Method for Plagiarism Detection over Academic Citation Networks

Authors:

Sidik Soleman and Atsushi Fujii

Abstract: Whereas in the academic publication, citation has been used for a long time to borrow ideas from another document and show the credit to the authors of that document, plagiarism, which does not indicate the appropriate credit for a borrowed idea, has of late become problematic. Because plagiarism detection has been formulated as finding partial near-duplicate in response to a document for a suspected case of plagiarism, in this paper we propose a method to improve the similarity computation between text fragments. Our contribution is to formulate three document similarities based on citation and content analysis, and to combine them in our method. We also show the effectiveness of our method experimentally and discuss its advantages and limitation.

Paper Nr: 38
Title:

LCA Histogram Distance for Rooted Labeled Caterpillars

Authors:

Takuya Yoshino, Kohei Muraka and Kouichi Hirata

Abstract: An LCA histogram distance is an L1-distance between histograms consisting of triples of two nodes and their least common ancestor (LCA) in two trees. In this paper, we show that the LCA histogram distance for caterpillars is always a metric, whereas that for trees is not. Then, we give experimental results for computing the LCA histogram distance by comparing with the path histogram distance and the complete subtree histogram distance for caterpillars.

Paper Nr: 39
Title:

The RICHFIELDS Framework for Semantic Interoperability of Food Information Across Heterogenous Information Systems

Authors:

Tome Eftimov, Gordana Ispirova, Peter Korošec and Barbara Koroušić Seljak

Abstract: In an EU-funded project RICHFIELDS, a data platform was designed with the aim to collect, link and harmonize, analyze, store, and deliver food- and nutrition-related data and information to various stakeholders. To integrate heterogenous food data sets, we propose a RICHFIELDS framework for semantic interoperability of food information, which is a combination of already developed NLP approaches for the food domain. The framework includes i) a food ontology to which foods are linked, ii) a part that explains how the relevant foods can be extracted and represented in a structured way, and iii) a similarity measure that is used to link the foods to the ontology. To evaluate the RICHFIELDS framework, we selected two distinct data sets from different food information systems. The experimental results provided promising results,i.e., 81.5% and 87.5% of the foods from the first and the second data set, respectively, obtained a tag from the ontology (i.e., semantic annotation was performed). The annotations provided by the framework allow automatic integration of food information provided in both data sets.

Paper Nr: 47
Title:

Low Level Big Data Processing

Authors:

Jaime Salvador-Meneses, Zoila Ruiz-Chavez and Jose Garcia-Rodriguez

Abstract: The machine learning algorithms, prior to their application, require that the information be stored in memory. Reducing the amount of memory used for data representation clearly reduces the number of operations required to process it. Many of the current libraries represent the information in the traditional way, which forces you to iterate the whole set of data to obtain the desired result. In this paper we propose a technique to process categorical information previously encoded using the bit-level schema, the method proposes a block processing which reduces the number of iterations on the original data and, at the same time, maintains a processing performance similar to the processing of the original data. The method requires the information to be stored in memory, which allows you to optimize the volume of memory consumed for representation as well as the operations required to process it. The results of the experiments carried out show a slightly lower time processing than the obtained with traditional implementations, which allows us to obtain a good performance.

Paper Nr: 48
Title:

Low Level Big Data Compression

Authors:

Jaime Salvador-Meneses, Zoila Ruiz-Chavez and Jose Garcia-Rodriguez

Abstract: In the last years, some specialized algorithms have been developed to work with categorical information, however the performance of these algorithms has two important factors to consider: the processing technique (algorithm) and the representation of information used. Many of the machine learning algorithms depend on whether the information is stored in memory, local or distributed, prior to processing. Many of the current compression techniques do not achieve an adequate balance between the compression ratio and the decompression speed. In this work we propose a mechanism for storing and processing categorical information by compression at the bit level, the method proposes a compression and decompression by blocks, with which the process of compressed information resembles the process of the original information. The proposed method allows to keep the compressed data in memory, which drastically reduces the memory consumption. The experimental results obtained show a high compression ratio, while the block decompression is very efficient. Both factors contribute to build a system with good performance.

Paper Nr: 50
Title:

Using Data Mining in a Mobile Application for the Calculation of the Female Fertile Period

Authors:

Francisco Vaz, Rodrigo Rocha Silva and Jorge Bernardino

Abstract: There is a great need that many women have for a better calculation of the fertile period, since this calculation is important to know the best moments to have a sexual intercourse without pregnancy or with the intention of generating a pregnancy. This work describes the use of data mining of in development a mobile application for the calculation of the female fertile period. The application contains the main functionalities needed, such as the insertion of symptoms and moods each day, a calendar with daily events in which you can see the risk of pregnancy, ovulation day, among other features, taking into account all the necessary topics, such as the architecture, as well as the data mining using Random Forest algorithm and some of the main functionalities. The application allows the sharing of information with doctors and/or partners as well as a prediction of the probability of delay for the next menstrual cycle. These two features are completely innovative and will allow the success of the application, through a greater number of downloads.

Paper Nr: 51
Title:

RDF Data Clustering based on Resource and Predicate Embeddings

Authors:

Siham Eddamiri, El Moukhtar Zemmouri and Asmaa Benghabrit

Abstract: With the increasing amount of Linked Data on the Web in the past decade, there is a growing desire for machine learning community to bring this type of data into the fold. However, while Linked Data and Machine Learning have seen an explosive growth in popularity, relatively little attention has been paid in the literature to the possible union of both Linked Data and Machine Learning. The best way to collaborate these two fields is to focus on RDF data. After a thorough overview of Machine learning pipeline on RDF data, the paper presents an unsupervised feature extraction technique named Walks and two language modeling approaches, namely Word2vec and Doc2vec. In order to adapt the RDF graph to the clustering mechanism, we first applied the Walks technique on several sequences of entities by combining it with the Word2Vec approach. However, the application of the Doc2vec approach to a set of walks gives better results on two different datasets.

Paper Nr: 54
Title:

Towards a Technological Platform for Transparent and Flexible Assessment of Smart Cities

Authors:

Dessislava Petrova-Antonova, Sylvia Ilieva and Irena Pavlova

Abstract: The concept of smart cities is widely accepted as a powerful tool to improve living standards in all city dimensions. Smart cities aim to provide better quality services in the field of health, transport, energy and education in order to increase the comfort of their citizens. Whether in the planning or implementation phase, a key success factor for building smart cities is measuring the productivity of the decisions and obtaining an assessment of the final results. Most cities perceive the smart city concept, many of them are working on strategies for its implementation and more and more of them take concrete actions for deployment of “smart” solutions. Two questions arise from this: “What are the challenges to become a smart city?” and “What the city undertakes to become smart?”. Their answers required assessment of the of city’s “smart services” and the social effect of deployment of “smart solutions” during the transformation from “smart” plan to “smart” process. In such a context, this paper proposes an architecture of technological platforms for assessment of city’s “smartness”. Its primary goal is to provide a transparent and flexible indicator framework that supports quantitative progress evaluation of smart city strategy implementation, feedback on efficiency of current policies, timely and informed decision making and increased understanding of future city challenges. The main building components of the platform, namely repository, web APIs and web user interface, are described. Additionally, a classification schema of indicators covering six main thematic areas is proposed.