Publications

Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top-k selection and top-k join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose balance-aware signatures where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1–3 orders of magnitude and has good scalability.

Influence maximization (IM) continues to be a key research problem in social networks. The goal is to find a small seed set of target users that have the greatest influence in the network under various stochastic diffusion models. While significant progress has been made on the IM problem in recent years, several interesting challenges remain. For example, social networks in reality are constantly evolving, and “important” users with the most influence also change over time. As a result, several recent studies have proposed approaches to update the seed set as the social networks evolve. However, this seed set is not guaranteed to be the best seed set over a period of time. In this paper we study the problem of Distinct Influence Maximization (DIM) where the goal is to identify a seed set of influencers who maximize the number of distinct users influenced over a predefined window of time. Our new approach allows social network providers to make fewer incremental changes to targeted advertising while still maximizing the coverage of the advertisements. It also provides finer grained control over service level agreements where a certain number of impressions for an advertisement must be displayed in a specific time period. We propose two different strategies HCS and VCS with novel graph compression techniques to solve this problem. Additionally, VCS can also be applied directly to the traditional IM problem. Extensive experiments on real-world datasets verify the efficiency, accuracy and scalability of our solutions on both the DIM and IM problems.

A multi-attribute trajectory consists of a sequence of time-stamped locations and a set of attributes that characterize diverse aspects of the corresponding moving object. In this paper, we study continuous range queries over multi-attribute trajectories. Such a query returns the objects whose attributes contain expected values and whose locations are always within a distance threshold to the query trajectory during the entire overlapping time period. To efficiently answer the query, an optimal method of partitioning the trajectories is proposed and an index structure is developed to support the combined search of spatio-temporal parameters and attribute values. We provide a general solution that is able to process multi-attribute trajectories as well as traditional trajectories without attributes. We carry out comprehensive experiments in a prototype database system to evaluate the efficiency and scalability of our designs. The experimental results show that our approach outperforms five alternative approaches by a factor of 5-50x on large datasets.

Understanding urban areas of interest (AOIs) is essential to decision making in various urban planning and exploration tasks. Such AOIs can be computed based on the geographic points that satisfy the user query. In this demo, we present an interactive visualization system of urban AOIs, supported by a parameter-free and efficient footprint method called AOI-shapes. Compared to state-of-the-art footprint methods, the proposed AOI-shapes (i) is parameter-free, (ii) is able to recognize multiple regions/outliers, (iii) can detect inner holes, and (iv) supports the incremental method. We demonstrate the effectiveness and efficiency of the proposed AOI-shapes based on a real-world real estate dataset in Australia.

In this paper, we would like to demonstrate an intelligent traffic analytics system called T4, which enables intelligent analytics over real-time and historical trajectories from vehicles. At the front end, we visualize the current traffic flow and result trajectories of different types of queries, as well as the histograms of traffic flow and traffic lights. At the back end, T4 is able to support multiple types of common queries over trajectories, with compact storage, efficient index and fast pruning algorithms. The output of those queries can be used for further monitoring and analytics purposes. Moreover, we train the deep models for traffic flow prediction and traffic light control to reduce traffic congestion

Given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories.

This paper presents a new trajectory search engine called Torch for querying road network trajectory data. Torch is able to efficiently process two types of typical queries (similarity search and Boolean search), and support a wide variety of trajectory similarity functions. Additionally, we propose a new similarity function LORS in Torch to measure the similarity in a more effective and efficient manner. Indexing and search in Torch works as follows. First, each raw vehicle trajectory is transformed to a set of road segments (edges) and a set of crossings (vertices) on the road network. Then a lightweight edge and vertex index called LEVI is built. Given a query, a filtering framework over LEVI is used to dynamically prune the trajectory search space based on the similarity measure imposed. Finally, the result set (ranked or Boolean) is returned. Extensive experiments on real trajectory datasets verify the effectiveness and efficiency of Torch.

In this paper, we demonstrate a trip planning system called TISP, which enables users’ interactive exploration of POIs and trajectories in their incremental trip planning. At the back end, TISP is able to support seven types of common queries over spatial-only, spatial-textual and textual-only data, based on our proposed unified indexing and search paradigm. At the front end, we propose novel visualization designs to present the result of different types of queries; our user-friendly interaction designs allow users to construct further queries without inputting any text.

Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.

In this demonstration we present POIsam, a visualization system supporting the following desirable features - representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints aim to efficiently select a small set of representative objects from the current region of user’s interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. One unique feature of POISam is that any similarity metrics can be plugged into POISam to meet the user’s specific needs in different scenarios. The latter two consistencies are fundamental challenges to efficiently update the selection result w.r.t. user’s zoom in, zoom out and panning operations when they interact with the map. POISam drops a common assumption from all previous work, i.e. the zoom levels and region cells are pre-defined and indexed, and objects are selected from such region cells at a particular zoom level rather than from user’s current region of interest which in most cases do not correspond to the pre-defined cells. It results in extra challenge as we need to do object selection via online computation. To our best knowledge, this is the first system that is able to meet all the four features to achieve an interactive visualization map exploration system.

With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features - representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user’s interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (sos) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. To further support interactive exploration of geospatial data on maps, we propose the Interactive sos (isos) problem, in which we enrich the sos problem with the zooming consistency and panning consistency constraints. The objective of isos is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the sos problem to solve the isos problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.

Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.

Since the mid-2000s, everal indexing techniques have been proposed to efficiently answer top-k spatial-textual queries. However, all of these approaches focus on answering one query at a time. In contrast, how to design efficient algorithms that can exploit similarities between incoming queries to improve performance has received little attention. In this article, we study a series of efficient approaches to batch process multiple top-k spatial-textual queries concurrently. We carefully design a variety of indexing structures for the problem space by exploring the effect of prioritizing spatial and textual properties on system performance. Specifically, we present an efficient traversal method, SF-Sep, over an existing space-prioritized index structure. Then, we propose a new space-prioritized index structure, the MIR-Tree to support a filter-and-refine based technique, SF-Grp. To support the processing of text-intensive data, we propose an augmented, inverted indexing structure that can easily be added into existing text search engine architectures and a novel traversal method for batch processing of the queries. In all of these approaches, the goal is to improve the overall performance by sharing the I/O costs of similar queries. Finally, we demonstrate significant I/O savings in our algorithms over traditional approaches by extensive experiments on three real datasets and compare how properties of different datasets affect the performance. Many applications in streaming, micro-batching of continuous queries, and privacy-aware search can benefit from this line of work.

GPS enables mobile devices to continuously provide new opportunities to improve our daily lives. For example, the data collected in applications created by Uber or Public Transport Authorities can be used to plan transportation routes, estimate capacities, and proactively identify low coverage areas. In this paper, we study a new kind of query-Reverse k Nearest Neighbor Search over Trajectories (RkNNT), which can be used for route planning and capacity estimation. Given a set of existing routes D_R , a set of passenger transitions D_T , and a query route Q, an RkNNT query returns all transitions that take Q as one of its k nearest travel routes. To solve the problem, we first develop an index to handle dynamic trajectory updates, so that the most up-to-date transition data are available for answering an RkNNT query. Then we introduce a filter refinement framework for processing RkNNT queries using the proposed indexes. Next, we show how to use RkNNT to solve the optimal route planning problem MaxRkNNT (MinRkNNT), which is to search for the optimal route from a start location to an end location that could attract the maximum (or minimum) number of passengers based on a predefined travel distance threshold. Experiments on real datasets demonstrate the efficiency and scalability of our approaches. To the best of our knowledge, this is the first work to study the RkNNT problem for route planning.

In this paper, we present HomeSeeker, an interactive visual analytics system to serve users with different backgrounds of the local real estate market and meet different degrees of user requirements. As a result, HomeSeeker augments existing commercial systems to help users discover hidden patterns, link various location-centered data to the price, as well as explore, filter and compare the properties, in order to easily find their preferred properties. In particular, we make the following contributions - (1) We present a problem abstraction for designing visualizations that help home buyers analyse the real estate data. Specifically, our data abstraction integrates heterogeneous data from different channels into a location-centred integrated real estate dataset. (2) We propose an interactive visual analytic procedure to help less informed users gradually learn about the local real estate market, upon which users exploit this learned knowledge to specify their individual requirements in property seeking. (3) We propose a series of designs to visualize properties/suburbs in different dimensions and in different granularities. We have collected, integrated and cleaned last 10 year’s real estate sold records in Australia as well as their location-related education, facility and transportation profiles, to generate a real multi-dimensional data repository, and implemented a system prototype for public access. At last, we present case studies based on real-world datasets and real scenario to demonstrate the usefulness and effectiveness of our system.

The problem of optimal location selection based on reverse k nearest neighbor (RkNN) queries has been extensively studied in spatial databases. In this work, we present a related query, denoted as a Maximized Bichromatic Reverse Spatial Textual k Nearest Neighbor (MaxST) query, that finds an optimal location and a set of keywords for an object so that the object is a kNN object for as many users as possible. Such a query has many practical applications including advertisements, where the query is to find the location and the text contents to include in an advertisement so that it is relevant to the maximum number of users. The visibility of the advertisements also has an important role in the users’ interests. In this work, we address two instances of the spatial relevance when ranking items - (1) the Euclidean distance and (2) the visibility. We carefully design a series of index structures and approaches to answer the MaxST for both instances. Specifically, we present (1) the Grp- topk approach that requires the computation of the top-k objects for all of the users first and then applies various pruning techniques to find the optimal location and keywords; (2) the Indiv- U approach, where we use similarity estimations to avoid computing the top-k objects of the users that cannot be a final result; and (3) the Index- U approach where we propose a hierarchical index structure over the users to improve pruning. We show that the keyword selection component in MaxST queries is NP-hard and present both approximate and exact solutions for the problem.

Urban data (e.g., real estate data, crime data) often have multiple attributes which are highly geography-related. With the scale of data increases, directly visualizing millions of individual data points on top of a map would overwhelm users’ perceptual and cognitive capacity and lead to high latency when users interact with the data. In this demo, we present ConvexCubes, a system that supports interactive visualization of large-scale multidimensional urban data in a multi-granularity way. Comparing to state-of-theart visualization-driven data structures, it exploits real-world geographic semantics (e.g., country, state, city) rather than using gridbased aggregation. Instead of calculating everything on demand, ConvexCubes utilizes existing visualization results to efficiently support different kinds of user interactions, such as zooming & panning, filtering and granularity control

This paper studies the location-based web search and aims to build a unified processing paradigm for two purposes - (1) efficiently support each of the various types of location-based queries (kNN query,top-k spatial-textual query, etc.) on two major forms of geo-tagged data, i.e., spatial point data such as geo-tagged web documents, and spatial trajectory data such as a sequence of geo-tagged travel blogs by a user; (2) support interactive search to provide quick response for a query session, within which a user usually keeps refining her query by either issuing different query types or specifying different constraints (e.g., adding a keyword and/or location, changing the choice of k, etc.) until she finds the desired results. To achieve this goal, we first propose a general Top-k query called Monotone Aggregate Spatial Keyword query-MASK, which is able to cover most types of location-based web search. Next, we develop a unified indexing (called Textual-Grid-Point Inverted Index) and query processing paradigm (called ETAIL Algorithm) to answer a single MASK query efficiently. Furthermore, we extend ETAIL to provide interactive search for multiple queries within one query session, by exploiting the commonality of textual and/or spatial dimension among queries. Last, extensive experiments on four real datasets verify the robustness and efficiency of our approach.

Main memory index is built with the assumption that the RAM is sufficiently large to hold data. Due to the volatility and high unit price of main memory, indices under secondary memory such as SSD and HDD are widely used. However, the I/O operation with main memory is still the bottleneck for query efficiency. In this paper, we propose a self-tuning indexing scheme called Tide-tree for RAM/Disk-based hybrid storage system. Tide-tree aims to overcome the obstacles main memory and disk-based indices face, and performs like the tide to achieve a double-win in space and performance, which is self-adaptive with respect to the running environment. Particularly, Tide-tree delaminates the tree structure adaptively with high efficiency based on storage sense, and applies an effective self-tuning algorithm to dynamically load various nodes into main memory. We employ memory mapping technology to solve the persistent problem of main memory index, and improves the efficiency of data synchronism and pointer translation. To further enhance the independence of Tide-tree, we employ the index head and the level address table to manage the whole index. With the index head, three efficient operations are proposed, namely index rebuild, index load and range search. We have conducted extensive experiments to compare the Tide-tree with several state-of-the-art indices, and the results have validated the high efficiency, reusability and stability of Tide-tree.

Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e., similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support efficient similarity search and join. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.

In this paper, we propose and study the problem of top-m rank aggregation of spatial objects in streaming queries, where, given a set of objects O, a stream of spatial queries (kNN or range), the goal is to report the m objects with the highest aggregate rank. The rank of an object with respect to an individual query is computed based on its distance from the query location, and the aggregate rank is computed from all of the individual rank orderings. In order to solve this problem, we show how to upper and lower bound the rank of an object for any unseen query. Then we propose an approximation solution to continuously monitor the top-m objects efficiently, for which we design an Inverted Rank File (IRF) index to guarantee the error bound of the solution. In particular, we propose the notion of safe ranking to determine whether the current result is still valid or not when new queries arrive, and propose the notion of validation objects to limit the number of objects to update in the top-m results. We also propose an exact solution for applications where an approximate solution is not sufficient. Last, we conduct extensive experiments to verify the efficiency and effectiveness of our solutions. This is a fundamental problem that draws inspiration from three different domains - rank aggregation, continuous queries and spatial databases, and the solution can be used to monitor the importance / popularity of spatial objects, which in turn can provide new analytical tools for spatial data.

We study a new type of spatial-textual trajectory search - the Exemplar Trajectory Query (ETQ), which specifies one or more places to visit, and descriptions of activities at each place. Our goal is to efficiently find the top-k trajectories by computing spatial and textual similarity at each point. The computational cost for pointwise matching is significantly higher than previous approaches. Therefore, we introduce an incremental pruning baseline and explore how to adaptively tune our approach, introducing a gap-based optimization and a novel twolevel threshold algorithm to improve efficiency. Our proposed methods support order-sensitive ETQ with a minor extension. Experiments on two datasets verify the efficiency and scalability of our proposed solution.

Energy efficiency of cloud computing has been given great attention more than ever before. One of the challenges is how to strike a balance between minimizing the energy consumption and meeting the quality of services such as satisfying performance and resource availability in a timely manner. Many studies based on the online migration technology attempt to move virtual machine from low utilization of hosts and then switch it off with the purpose of reducing energy consumption. In this paper, we aim to develop an adaptive task scheduling strategy. In particular, we first model the virtual machine energy from the perspective of the cloud task scheduling, then we propose a genetic algorithm to achieve adaptive regulations for different requirements of energy and performance in cloud tasks (E-PAGA). Then we design two types of the fitness function for choosing the next generation with different preferences on energy and performance. As a result, we can adaptively adjust the energy and performance target before assigning the task in cloud, which is able to meet various requirements from different users. From the extensive experiments, we pinpoint several important observations which are useful in configuring real cloud data centers - 1) we prove that guaranteeing the minimum total task time usually leads to low energy consumption to some extent; 2) we must pay the price of the sacrificed performance if only taking into account the energy optimization; 3) we come to the conclusion that there is always an optimal condition of energy-efficiency ratio in the cloud data center, and more importantly the specific conditions of the optimal energy-efficiency ratio can be obtained.