Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Personal Web Revisitation by Context and Content Keywords with Relevance Feedback

Introduction:
Getting back to previously viewed web pages is a common yet uneasy task for users due to the large volume of personally accessed information on the web. This paper leverages human’s natural recall process of using episodic and semantic memory cues to facilitate recall, and presents a personal web revisitation technique called WebPage prev through context and content keywords. Underlying techniques for context and content memories’ acquisition, storage, decay, and utilization for page re-finding are discussed. A relevance feedback mechanism is also involved to tailor to individual’s memory strength and revisitation habits. Our 6-month user study shows that: (1) Compared with the existing web revisitation tool Memento, History List Searching method, and Search Engine method, the proposed WebPage prev delivers the best re-finding quality in finding rate (92.10%), average F1-measure (0.4318) and average rank error (0.3145). (2) Our dynamic management of context and content memories including decay and reinforcement strategy can mimic users’ retrieval and recall mechanism. With relevance feedback, the finding rate of WebPagePrev increases by 9.82%, average F1-measure increases by 47.09%, and average rank error decreases by 19.44% compared to stable memory management strategy. Among time, location, and activity context factors in WebPagePrev, activity is the best recall cue, and context+content based re-finding delivers the best performance, compared to context based re-finding and content based re-finding.

Reference IEEE paper:
“Personal Web Revisitation by Context and Content Keywords with Relevance Feedback”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1042

DomainDATA MINING

bookmyproject

Book your project Now.  Checkout other projects here

PPRank: Economically Selecting Initial Users for Influence Maximization in Social Networks

Introduction:
This paper focuses on seeking a new heuristic scheme for an influence maximization problem in social networks: how to economically select a subset of individuals (so-called seeds) to trigger a large cascade of further adoptions of a new behavior based on a contagion process. Most existing works on selection of seeds assumed that the constant number k seeds could be selected, irrespective of the intrinsic property of each individual’s different susceptibility of being influenced (e.g., it may be costly to persuade some seeds to adopt a new behaviour). In this paper, a price-performance-ratio inspired heuristic scheme, PPRank, is proposed, which investigates how to economically select seeds within a given budget and meanwhile try to maximize the diffusion process. Our paper’s contributions are threefold. First, we explicitly characterize each user with two distinct factors: the susceptibility of being influenced (SI) and influential power (IP) representing the ability to actively influence others and formulate users’ SIs and IPs according to their social relations, and then, a convex price-demand curve-based model is utilized to properly convert each user’s SI into persuasion cost (PC) representing the cost used to successfully make the individual adopt a new behaviour. Furthermore, a novel cost-effective selection scheme is proposed, which adopts both the price performance ratio (PC-IP ratio) and user’s IP as an integrated selection criterion and meanwhile explicitly takes into account the overlapping effect; finally, simulations using both artificially generated and real-trace network data illustrate that, under the same budgets, PPRank can achieve larger diffusion range than other heuristic and brute-force greedy schemes without taking users’ persuasion costs into account.

Reference IEEE paper:
“PPRank: Economically Selecting Initial Users for Influence Maximization in Social Networks”, IEEE SYSTEMS JOURNAL, 2017.

Unique ID -SBI1043

DomainDATA MINING

Book your project Now.  Checkout other projects here

QDA: A Query Driven Approach to Entity Resolution

Introduction:
This paper addresses the problem of query-aware data cleaning in the context of a user query. In particular, we develop a novel Query-Driven Approach (QDA) that systematically exploits the semantics of the predicates in SQL-like selection queries to reduce the data cleaning overhead. The objective of QDA is to issue the minimum number of cleaning steps that are necessary to answer a given SQL-like selection correctly. The comprehensive empirical evaluation of QDA demonstrates outstanding results – that is QDA is significantly better compared to traditional ER techniques, especially when the query is very selective.

Reference IEEE paper:
“QDA: A Query Driven Approach to Entity Resolution”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1044

DomainDATA MINING

Book your project Now.  Checkout other projects here

Query Expansion with Enriched User Profiles for Personalized Search Utilizing Folksonomy Data

Introduction:
Query expansion has been widely adopted in Web search as a way of tackling the ambiguity of queries. Personalized search utilizing folksonomy data has demonstrated an extreme vocabulary mismatch problem that requires even more effective query expansion methods. Co-occurrence statistics, tag-tag relationships and semantic matching approaches are among those favoured by previous research. However, user profiles which only contain a user’s past annotation information may not be enough to support the selection of expansion terms, especially for users with limited previous activity with the system. We propose a novel model to construct enriched user profiles with the help of an external corpus for personalized query expansion. Our model integrates the current state-of-the-art text representation learning framework, known as word embeddings, with topic models in two groups of pseudo-aligned documents. Based on user profiles, we build two novel query expansion techniques. These two techniques are based on topical weights-enhanced word embeddings, and the topical relevance between the query and the terms inside a user profile respectively. The results of an in-depth experimental evaluation, performed on two real-world datasets using different external corpora, show that our approach outperforms traditional techniques, including existing non-personalized and personalized query expansion methods.

Reference IEEE paper:
“Query Expansion with Enriched User Profiles for Personalized Search Utilizing Folksonomy Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017.

Unique ID -SBI1045

DomainDATA MINING

Book your project Now.  Checkout other projects here

RAPARE: A Generic Strategy for Cold Start Rating Prediction Problem

Introduction:
In recent years, recommender system is one of indispensable components in many e-commerce websites. One of the major challenges that largely remains open is the cold-start problem, which can be viewed as a barrier that keeps the cold-start users/items away from the existing ones. In this paper, we aim to break through this barrier for cold-start users/items by the assistance of existing ones. In particular, inspired by the classic Elo Rating System, which has been widely adopted in chess tournaments; we propose a novel rating comparison strategy (RAPARE) to learn the latent profiles of cold-start users/items. The center-piece of our RAPARE is to provide a fine-grained calibration on the latent profiles of cold-start users/items by exploring the differences between cold-start and existing users/items. As a generic strategy, our proposed strategy can be instantiated into existing methods in recommender systems. To reveal the capability of RAPARE strategy, we instantiate our strategy on two prevalent methods in recommender systems, i.e., the matrix factorization based and neighborhood based collaborative filtering. Experimental evaluations on five real data sets validate the superiority of our approach over the existing methods in cold-start scenario.

Reference IEEE paper:
“RAPARE: A Generic Strategy for Cold Start Rating Prediction Problem”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1046

DomainDATA MINING

Book your project Now.  Checkout other projects here

SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors

Introduction:
Mass media sources, specifically the news media, have traditionally informed us of daily events. In modern times, social media services such as Twitter provide an enormous amount of user-generated data, which have great potential to contain informative news-related content. For these resources to be useful, we must find a way to filter noise and only capture the content that, based on its similarity to the news media, is considered valuable. However, even after noise is removed, information overload may still exist in the remaining data—hence, it is convenient to prioritize it for consumption. To achieve prioritization, information must be ranked in order of estimated importance considering three factors. First, the temporal prevalence of a particular topic in the news media is a factor of importance, and can be considered the media focus (MF) of a topic. Second, the temporal prevalence of the topic in social media indicates its user attention (UA). Last, the interaction between the social media users who mention this topic indicates the strength of the community discussing it, and can be regarded as the user interaction (UI) toward the topic. We propose an unsupervised framework—SociRank—which identifies news topics prevalent in both social media and the news media, and then ranks them by relevance using their degrees of MF, UA, and UI. Our experiments show that SociRank improves the quality and variety of automatically identified news topics.

Reference IEEE paper:
“SociRank: Identifying and Ranking Prevalent News Topics Using Social Media Factors”, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, 2017.

Unique ID -SBI1047

DomainDATA MINING

Book your project Now.  Checkout other projects here

Towards Real-Time, Country-Level Location Classification of Worldwide Tweets

Introduction:
The increase of interest in using social media as a source for research has motivated tackling the challenge of automatically geolocating tweets, given the lack of explicit location information in the majority of tweets. In contrast to much previous work that has focused on location classification of tweets restricted to a specific country, here we undertake the task in a broader context by classifying global tweets at the country level, which is so far unexplored in a real-time scenario. We analyse the extent to which a tweet’s country of origin can be determined by making use of eight tweet-inherent features for classification. Furthermore, we use two datasets, collected a year apart from each other, to analyse the extent to which a model trained from historical tweets can still be leveraged for classification of new tweets. With classification experiments on all 217 countries in our datasets, as well as on the top 25 countries, we offer some insights into the best use of tweet-inherent features for an accurate country-level classification of tweets. We find that the use of a single feature, such as the use of tweet content alone – the most widely used feature in previous work – leaves much to be desired. Choosing an appropriate combination of both tweet content and metadata can actually lead to substantial improvements of between 20% and 50%. We observe that tweet content, the user’s self-reported location and the user’s real name, all of which are inherent in a tweet and available in a real-time scenario, are particularly useful to determine the country of origin. We also experiment on the applicability of a model trained on historical tweets to classify new tweets, finding that the choice of a particular combination of features whose utility does not fade over time can actually lead to comparable performance, avoiding the need to retrain. However, the difficulty of achieving accurate classification increases slightly for countries with multiple commonalities, especially for English and Spanish speaking countries.

Reference IEEE paper:
“Towards Real-Time, Country-Level Location Classification of Worldwide Tweets”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1048

DomainDATA MINING

Book your project Now.  Checkout other projects here

Trajectory Community Discovery and Recommendation by Multisource Diffusion Modeling

Introduction:
In this paper, we detect communities from trajectories. Existing algorithms for trajectory clustering usually rely on simplex representation and a single proximity-related metric. Unfortunately, additional information markers (e.g., social interactions or semantics in the spatial layout) are ignored, leading to the inability to fully discover the communities in trajectory database. This is especially true for human-generated trajectories, where additional fine-grained markers (e.g., movement velocity at certain locations, or the sequence of semantic spaces visited) are especially useful in capturing latent relationships among community members. To overcome this limitation, we propose TODMIS, a general framework for Trajectory-based cOmmunity Detection by diffusion modeling on Multiple Information Sources. TODMIS combines additional information with raw trajectory data and construct the diffusion process on multiple similarity metrics. It also learns the consistent graph Laplacians by constructing the multi-modal diffusion process and optimizing the heat kernel coupling on each pair of similarity matrices from multiple information sources. Then, dense sub-graph detection is used to discover the set of distinct communities (including community size) on the coupled multi-graph representation. At last, based on the community information, we propose a novel model for online recommendation. We evaluate TODMIS and our online recommendation methods using different real-life datasets. Experimental results demonstrate the effectiveness and efficiency of our methods.

Reference IEEE paper:
“Trajectory Community Discovery and Recommendation by Multi-source Diffusion Modeling” , IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1049

DomainDATA MINING

Book your project Now.  Checkout other projects here

User Centric Similarity Search

Introduction:

User preferences play a significant role in market analysis. In the database literature there has been extensive work on query primitives, such as the well known top-k query that can be used for the ranking of products based on the preferences customers have expressed. Still, the fundamental operation that evaluates the similarity between products is typically done ignoring these preferences. Instead products are depicted in a feature space based on their attributes and similarity is computed via traditional distance metrics on that space. In this work we utilize the rankings of the products based on the opinions of their customers in order to map the products in a user-centric space where similarity calculations are performed. We identify important properties of this mapping that result in upper and lower similarity bounds, which in turn permit us to utilize conventional multidimensional indexes on the original product space in order to perform these user-centric similarity computations. We show how interesting similarity calculations that are motivated by the commonly used range and nearest neighbor queries can be performed efficiently, while pruning significant parts of the data set based on the bounds we derive on the user-centric similarity of products.

Reference IEEE paper:

“User-Centric Similarity Search”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID – SBI1052

DomainDATA MINING

Book your project Now.  Checkout other projects here

User Vitality Ranking and Prediction in Social Networking Services: a Dynamic Network Perspective

Introduction:

Social networking services have been prevalent at many online communities such as Twitter.com and Weibo.com, where millions of users keep interacting with each other every day. One interesting and important problem in the social networking services is to rank users based on their vitality in a timely fashion. An accurate ranking list of user vitality could benefit many parties in social network services such as the ads providers and site operators. Although it is very promising to obtain a vitality-based ranking list of users, there are many technical challenges due to the large scale and dynamics of social networking data. In this paper, we propose a unique perspective to achieve this goal, which is quantifying user vitality by analyzing the dynamic interactions among users on social networks. Examples of social network include but are not limited to social networks in microblog sites and academical collaboration networks. Intuitively, if a user has many interactions with his friends within a time period and most of his friends do not have many interactions with their friends simultaneously, it is very likely that this user has high vitality. Based on this idea, we develop quantitative measurements for user vitality and propose our first algorithm for ranking users based vitality. Also we further consider the mutual influence between users while computing the vitality measurements and propose the second ranking algorithm, which computes user vitality in an iterative way. Other than user vitality ranking, we also introduce a vitality prediction problem, which is also of great importance for many applications in social networking services. Along this line, we develop a customized prediction model to solve the vitality prediction problem. To evaluate the performance of our algorithms, we collect two dynamic social network data sets. The experimental results with both data sets clearly demonstrate the advantage of our ranking and prediction methods.

Reference IEEE paper:

“User Vitality Ranking and Prediction in Social Networking Services: a Dynamic Network Perspective”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2017.

Unique ID – SBI1051

DomainDATA MINING

Book your project Now.  Checkout other projects here

Understand Short Texts by Harvesting and Analyzing Semantic Knowledge

Introduction:
Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing tools, ranging from part-of-speech tagging to dependency parsing, cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text mining such as topic modelling. Third, short texts are more ambiguous and noisy, and are generated in an enormous volume, which further increases the difficulty to handle them. We argue that semantic knowledge is required in order to better understand short texts. In this work, we build a prototype system for short text understanding which exploits semantic knowledge provided by a well-known knowledge base and automatically harvested from a web corpus. Our knowledge-intensive approaches disrupt traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labelling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that semantic knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are both effective and efficient in discovering semantics of short texts.

Reference IEEE paper:
“Understand Short Texts by Harvesting and Analyzing Semantic Knowledge”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1050

DomainDATA MINING

Book your project Now.  Checkout other projects here

An Iterative Classification Scheme for Sanitizing Large-Scale
Datasets

Introduction:
Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

Reference IEEE paper:
“An Iterative Classification Scheme for Sanitizing Large-Scale Datasets”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1028

DomainDATA MINING

Book your project Now.  Checkout other projects here

Analyzing Sentiments in One Go: A Supervised Joint Topic
Modeling Approach

Introduction:
In this work, we focus on modeling user-generated review and overall rating pairs, and aim to identify semantic aspects and aspect-level sentiments from review data as well as to predict overall sentiments of reviews. We propose a novel probabilistic supervised joint aspect and sentiment model (SJASM) to deal with the problems in one go under a unified framework. SJASM represents each review document in the form of opinion pairs, and can simultaneously model aspect terms and corresponding opinion words of the review for hidden aspect and sentiment detection. It also leverages sentimental overall ratings, which often comes with online reviews, as supervision data, and can infer the semantic aspects and aspect-level sentiments that are not only meaningful but also predictive of overall sentiments of reviews. Moreover, we also develop efficient inference method for parameter estimation of SJASM based on collapsed Gibbs sampling. We evaluate SJASM extensively on real-world review data, and experimental results demonstrate that the proposed model outperforms seven well-established baseline methods for sentiment analysis tasks.

Reference IEEE paper:
“Analyzing Sentiments in One Go: A Supervised Joint Topic Modeling Approach”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1029

DomainDATA MINING

Book your project Now.  Checkout other projects here

Collaborative Filtering-Based Recommendation of Online Social
Voting

Introduction:
Social voting is an emerging new feature in online social networks. It poses unique challenges and opportunities for recommendation. In this paper, we develop a set of matrix factorization (MF) and nearest-neighbor (NN)-based recommender systems (RSs) that explore user social network and group affiliation information for social voting recommendation. Through experiments with real social voting traces, we demonstrate that social network and group affiliation information can significantly improve the accuracy of popularity-based voting recommendation, and social network information dominates group affiliation information in NN-based approaches. We also observe that social and group information is much more valuable to cold users than to heavy users. In our experiments, simple meta path based NN models outperform computation-intensive MF models in hot-voting recommendation, while users’ interests for non-hot votings can be better mined by MF models. We further propose a hybrid RS, bagging different single approaches to achieve the best top-k hit rate.

Reference IEEE paper:
“Collaborative Filtering-Based Recommendation of Online Social Voting”, IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2017.

Unique ID -SBI1030

DomainDATA MINING

Book your project Now.  Checkout other projects here

Computing Semantic Similarity of Concepts in Knowledge Graphs

Introduction:
This paper presents a method for measuring the semantic similarity between concepts in Knowledge Graphs (KGs) such as WordNet and DBpedia. Previous work on semantic similarity methods have focused on either the structure of the semantic network between concepts (e.g. path length and depth), or only on the Information Content (IC) of concepts. We propose a semantic similarity method, namely wpath, to combine these two approaches, using IC to weight the shortest path length between concepts. Conventional corpus-based IC is computed from the distributions of concepts over textual corpus, which is required to prepare a domain corpus containing annotated concepts and has high computational cost. As instances are already extracted from textual corpus and annotated by concepts in KGs, graph-based IC is proposed to compute IC based on the distributions of concepts over instances. Through experiments performed on well known word similarity datasets, we show that the wpath semantic similarity method has produced statistically significant improvement over other semantic similarity methods. Moreover, in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score.

Reference IEEE paper :
“Computing Semantic Similarity of Concepts in Knowledge Graphs”, IEEE Transactions on Knowledge and Data Engineering 2017.

Unique ID -SBI1031

DomainDATA MINING

Book your project Now.  Checkout other projects here

Detecting Stress Based on Social Interactions in Social Networks

Introduction:
Psychological stress is threatening people’s health. It is non-trivial to detect stress timely for proactive care. With the popularity of social media, people are used to sharing their daily activities and interacting with friends on social media platforms, making it feasible to leverage online social network data for stress detection. In this paper, we find that users stress state is closely related to that of his/her friends in social media, and we employ a large-scale dataset from real-world social platforms to systematically study the correlation of users’ stress states and social interactions. We first define a set of stress-related textual, visual, and social attributes from various aspects, and then propose a novel hybrid model – a factor graph model combined with Convolutional Neural Network to leverage tweet content and social interaction information for stress detection. Experimental results show that the proposed model can improve the detection performance by 6-9% in F1-score. By further analyzing the social interaction data, we also discover several intriguing phenomena, i.e. the number of social structures of sparse connections (i.e. with no delta connections) of stressed users is around 14% higher than that of non-stressed users, indicating that the social structure of stressed users’ friends tend to be less connected and less complicated than that of non-stressed users.

Reference IEEE paper:
“Detecting Stress Based on Social Interactions in Social Networks”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1032

DomainDATA MINING

Book your project Now.  Checkout other projects here

Dynamic Facet Ordering for Faceted Product Search Engines

Introduction:
Faceted browsing is widely used in Web shops and product comparison sites. In these cases, a fixed ordered list of facets is often employed. This approach suffers from two main issues. First, one needs to invest a significant amount of time to devise an effective list. Second, with a fixed list of facets it can happen that a facet becomes useless if all products that match the query are associated to that particular facet. In this work, we present a framework for dynamic facet ordering in e-commerce. Based on measures for specificity and dispersion of facet values, the fully automated algorithm ranks those properties and facets on top that lead to a quick drill-down for any possible target product. In contrast to existing solutions, the framework addresses e-commerce specific aspects, such as the possibility of multiple clicks, the grouping of facets by their corresponding properties, and the abundance of numeric facets. In a large-scale simulation and user study, our approach was, in general, favorably compared to a facet list created by domain experts, a greedy approach as baseline, and a state-of-the-art entropy-based solution.

Reference IEEE paper:
“Dynamic Facet Ordering for Faceted Product Search Engines”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2017

Unique ID -SBI1033

DomainDATA MINING

Book your project Now.  Checkout other projects here

Efficient Clue-based Route Search on Road Networks

Introduction:
With the advances in geo-positioning technologies and location-based services, it is nowadays quite common for road networks to have textual contents on the vertices. Previous work on identifying an optimal route that covers a sequence of query keywords has been studied in recent years. However, in many practical scenarios, an optimal route might not always be desirable. For example, a personalized route query is issued by providing some clues that describe the spatial context between PoIs along the route, where the result can be far from the optimal one. Therefore, in this paper, we investigate the problem of clue-based route search (CRS), which allows a user to provide clues on keywords and spatial relationships. First, we propose a greedy algorithm and a dynamic programming algorithm as baselines. To improve efficiency, we develop a branch-and-bound algorithm that prunes unnecessary vertices in query processing. In order to quickly locate candidate, we propose an AB-tree that stores both the distance and keyword information in tree structure. To further reduce the index size, we construct a PB-tree by utilizing the virtue of 2-hop label index to pinpoint the candidate. Extensive experiments are conducted and verify the superiority of our algorithms and index structures.

Reference IEEE paper:
“Efficient Clue-based Route Search on Road Networks”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1034

DomainDATA MINING

Book your project Now.  Checkout other projects here

Efficient Keyword-aware Representative Travel Route Recommendation

Introduction:
With the popularity of social media (e.g., Facebook and Flicker), users can easily share their check-in records and photos during their trips. In view of the huge number of user historical mobility records in social media, we aim to discover travel experiences to facilitate trip planning. When planning a trip, users always have specific preferences regarding their trips. Instead of restricting users to limited query options such as locations, activities or time periods, we consider arbitrary text descriptions as keywords about personalized requirements. Moreover, a diverse and representative set of recommended travel routes is needed. Prior works have elaborated on mining and ranking existing routes from check-in data. To meet the need for automatic trip organization, we claim that more features of Places of Interest (POIs) should be extracted. Therefore, in this paper, we propose an efficient Keyword-aware Representative Travel Route framework that uses knowledge extraction from users’ historical mobility records and social interactions. Explicitly, we have designed a keyword extraction module to classify the POI-related tags, for effective matching with query keywords. We have further designed a route reconstruction algorithm to construct route candidates that fulfill the requirements. To provide befitting query results, we explore Representative Skyline concepts, that is, the Skyline routes which best describe the trade-offs among different POI features. To evaluate the effectiveness and efficiency of the proposed algorithms, we have conducted extensive experiments on real location-based social network datasets, and the experiment results show that our methods do indeed demonstrate good performance compared to state-of-the-art works.

Reference IEEE paper:
“Efficient Keyword-aware Representative Travel Route Recommendation”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1035

DomainDATA MINING

Book your project Now.  Checkout other projects here

Energy efficient query processing in Web Search Engines

Introduction:
Web search engines are composed by thousands of query processing nodes, i.e., servers dedicated to process user queries. Such many servers consume a significant amount of energy, mostly accountable to their CPUs, but they are necessary to ensure low latencies, since users expect sub-second response times (e.g., 500 ms). However, users can hardly notice response times that are faster than their expectations. Hence, we propose the Predictive Energy Saving Online Scheduling Algorithm (PESOS) to select the most appropriate CPU frequency to process a query on a per-core basis. PESOS aims at process queries by their deadlines, and leverage high-level scheduling information to reduce the CPU energy consumption of a query processing node. PESOS bases its decision on query efficiency predictors, estimating the processing volume and processing time of a query. We experimentally evaluate PESOS upon the TREC ClueWeb09B collection and the MSN2006 query log. Results show that PESOS can reduce the CPU energy consumption of a query processing node up to _48% compared to a system running at maximum CPU core frequency. PESOS outperforms also the best state-of-the-art competitor with a _20% energy saving, while the competitor requires a fine parameter tuning and it may incurs in uncontrollable latency violations.

Reference IEEE paper :
“Energy-efficient Query Processing in Web Search Engines”, IEEE Transactions on Knowledge and Data Engineering, 2017

Unique ID -SBI1036

DomainDATA MINING

Book your project Now.  Checkout other projects here

Generating Query Facets using Knowledge Bases

Introduction:
A query facet is a significant list of information nuggets that explains an underlying aspect of a query. Existing algorithms mine facets of a query by extracting frequent lists contained in top search results. The coverage of facets and facet items mined by this kind of methods might be limited, because only a small number of search results are used. In order to solve this problem, we propose mining query facets by using knowledge bases which contain high-quality structured data. Specifically, we first generate facets based on the properties of the entities which are contained in Freebase and correspond to the query. Second, we mine initial query facets from search results, then expanding them by finding similar entities from Freebase. Experimental results show that our proposed method can significantly improve the coverage of facet items over the state-of-the-art algorithms.

Reference IEEE paper:
“Generating Query Facets using Knowledge Bases”, IEEE Transactions on Knowledge and Data Engineering 2017.

Unique ID -SBI1037

DomainDATA MINING

Book your project Now.  Checkout other projects here

Influential Node Tracking on Dynamic Social Network: An Interchange Greedy Approach

Introduction:
As both social network structure and strength of influence between individuals evolve constantly, it requires to track the influential nodes under a dynamic setting. To address this problem, we explore the Influential Node Tracking (INT) problem as an extension to the traditional Influence Maximization problem (IM) under dynamic social networks. While Influence Maximization problem aims at identifying a set of k nodes to maximize the joint influence under one static network, INT problem focuses on tracking a set of influential nodes that keeps maximizing the influence as the network evolves. Utilizing the smoothness of the evolution of the network structure, we propose an efficient algorithm, Upper Bound Interchange Greedy (UBI) and a variant, UBI+. Instead of constructing the seed set from the ground, we start from the influential seed set we find previously and implement node replacement to improve the influence coverage. Furthermore, by using a fast update method by calculating the marginal gain of nodes, our algorithm can scale to dynamic social networks with millions of nodes. Empirical experiments on three real large-scale dynamic social networks show that our UBI and its variants, UBI+ achieves better performance in terms of both influence coverage and running time

Reference IEEE paper:
“Influential Node Tracking on Dynamic Social Network: An Interchange Greedy Approach”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1038

DomainDATA MINING

Book your project Now.  Checkout other projects here

l-Injection: Toward Effective Collaborative Filtering Using Uninteresting Items

Introduction:
We develop a novel framework, named as l-injection, to address the sparsity problem of recommender systems. By carefully injecting low values to a selected set of unrated user-item pairs in a user-item matrix, we demonstrate that top-N recommendation accuracies of various collaborative filtering (CF) techniques can be significantly and consistently improved. We first adopt the notion of pre-use preferences of users toward a vast amount of unrated items. Using this notion, we identify uninteresting items that have not been rated yet but are likely to receive low ratings from users, and selectively impute them as low values. As our proposed approach is method-agnostic, it can be easily applied to a variety of CF algorithms. Through comprehensive experiments with three real-life datasets (e.g., Movielens, Ciao, and Watcha), we demonstrate that our solution consistently and universally enhances the accuracies of existing CF algorithms (e.g., item-based CF, SVD-based CF, and SVD++) by 2.5 to 5 times on average. Furthermore, our solution improves the running time of those CF methods by 1.2 to 2.3 times when its setting produces the best accuracy.

Reference IEEE paper:
“l-Injection: Toward Effective Collaborative Filtering Using Uninteresting Items”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1039

DomainDATA MINING

Book your project Now.  Checkout other projects here

Mining Competitors from Large Unstructured Datasets

Introduction:
In any competitive business, success is based on the ability to make an item more appealing to customers than the competition. A number of questions arise in the context of this task: how do we formalize and quantify the competitiveness between two items? Who are the main competitors of a given item? What are the features of an item that most affect its competitiveness? Despite the impact and relevance of this problem to many domains, only a limited amount of work has been devoted toward an effective solution. In this paper, we present a formal definition of the competitiveness between two items, based on the market segments that they can both cover. Our evaluation of competitiveness utilizes customer reviews, an abundant source of information that is available in a wide range of domains. We present efficient methods for evaluating competitiveness in large review datasets and address the natural problem of finding the top-k competitors of a given item. Finally, we evaluate the quality of our results and the scalability of our approach using multiple datasets from different domains.

Reference IEEE paper:

“Mining Competitors from Large Unstructured Datasets”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1040

DomainDATA MINING

Book your project Now.  Checkout other projects here

Modeling Information Diffusion over Social Networks for Temporal Dynamic Prediction

Introduction:
How to model the process of information diffusion in social networks is a critical research task. Although numerous attempts have been made for this study, few of them can simulate and predict the temporal dynamics of the diffusion process. To address this problem, we propose a novel information diffusion model (GT model), which considers the users in network as intelligent agents. The agent jointly considers all his interacting neighbours and calculates the payoffs for his different choices to make strategic decision. We introduce the time factor into the user payoff, enabling the GT model to not only predict the behaviour of a user but also to predict when he will perform the behaviour. Both the global influence and social influence are explored in the time dependent payoff calculation, where a new social influence representation method is designed to fully capture the temporal dynamic properties of social influence between users. Experimental results on Sina Weibo and Flickr validate the effectiveness of our methods.

Reference IEEE paper:
“Modeling Information Diffusion over Social Networks for Temporal Dynamic Prediction”, IEEE Transactions on Knowledge and Data Engineering, 2017.

Unique ID -SBI1041

DomainDATA MINING

Book your project Now.  Checkout other projects here

Topic Rehotting Prediction in Online Social Networks

Topic rehotting prediction is popular technique in social networks. It is really popular to detect hot topics, which can benefit many tasks including topic recommendations, the guidance of public opinions, and so on. However, in some cases, people may want to know when to re-hot a topic, i.e., make the topic popular again. In this paper, we address this issue by introducing a temporal User Topic Participation (UTP) model which models users behaviours of posting messages. The UTP model takes into account users interests, friend-circles, and unexpected events in online social networks. Also, it considers the continuous temporal modelling of topics, since topics are changing continuously over time. Furthermore, a weighting scheme is proposed to smooth the fluctuations in topic re-hotting prediction. Finally, experimental results conducted on real world data sets demonstrate the effectiveness of our proposed models and topic re-hotting prediction methods.

Book your project Now.  Checkout other projects here

bookmyproject