Publications

We study the problem of index selection to maximize the workload performance, which is critical to database systems. In contrast to existing methods, we seamlessly integrate index recommendation rules and deep reinforcement learning, such that we can recommend single-attribute and multi-attribute indexes together for complex queries and meanwhile support multiple-index access to a table. Specifically, we first propose five heuristic rules to generate the index candidates. Then, we formulate the index selection problem as a reinforcement learning task and employ Deep Q Network (DQN) on it. Using the heuristic rules can significantly reduce the dimensions of the action space and state space in reinforcement learning. With the neural network used in DQN, we can model the interactions between indexes better than previous methods. We conduct experiments on various workloads to show its superiority.

Learning users’ preferences is critical to personalized search and recommendation. Most such systems depend on lists of items rank-ordered according to the user’s preference. Ideally, we want the system to adjust its estimate of users’ preferences after every interaction, thereby becoming progressively better at giving the user what she wants. We also want these adjustments to be gradual and explainable, so that the user is not surprised by wild swings in system rank ordering. In this paper, we support a rank-reversal operation on two items x and y for users - adjust the user’s preference such that the personalized rank of x and y is reversed. We emphasize that this problem is orthogonal to the preference learning and its solutions can run on top of the learning outcome of any vector-embedding-based preference learning model. Therefore, our preference adjustment techniques enable all those existing offline preference learning models to incrementally and interactively improve their response to (indirectly specified) user preferences. Specifically, we define the Minimum Dimension Adjustment (MDA) problem, where the preference adjustments are under certain constraints imposed by a specific graph and the goal is to adjust a user’s preference by reversing the personalized rank of two given items while minimizing the number of dimensions with value changed in the preference vector. We first prove that MDA is NP-hard, and then show that a 2.17-approximate solution can be obtained in polynomial time provided that an optimal solution to a carefully designed problem is given. Finally, we propose two efficient heuristic algorithms, where the first heuristic algorithm can achieve an approximation guarantee, and the second is provably efficient. Experiments on five publicly available datasets show that our solutions can adjust users’ preferences effectively and efficiently.

In this paper, we study the problem of large-scale trajectory data clustering, k-paths, which aims to efficiently identify k representative paths in a road network. Unlike traditional clustering approaches that require multiple data-dependent hyperparameters, k-paths can be used for visual exploration in applications such as traffic monitoring, public transit planning, and site selection. By combining map matching with an efficient intermediate representation of trajectories and a novel edge-based distance (EBD) measure, we present a scalable clustering method to solve k-paths. Experiments verify that we can cluster millions of taxi trajectories in less than one minute, achieving improvements of up to two orders of magnitude over state-of-the-art solutions that solve similar trajectory clustering problems.

Existing spatial object recommendation algorithms generally treat objects identically when ranking them. However, spatial objects often cover different levels of spatial granularity and thereby are heterogeneous. For example, one user may prefer to be recommended a region (say Manhattan), while another user might prefer a venue (say a restaurant). Even for the same user, preferences can change at different stages of data exploration. In this paper, we study how to support top-𝑘 spatial object recommendations at varying levels of spatial granularity, enabling spatial objects at have granularity, such as a city, suburb, or building, as a Point of Interest (POI). To solve this problem, we propose the use of a POI tree, which captures spatial containment relationships between POIs. We design a novel multi-task learning model called MPR (short for Multi-level POI Recommendation), where each task aims to return the top-𝑘 POIs at a certain spatial granularity level. Each task consists of two subtasks - (i) attribute-based representation learning; (ii) interaction-based representation learning. The first subtask learns the feature representations for both users and POIs, capturing attributes directly from their profiles. The second subtask incorporates user-POI interactions into the model. Additionally, MPR can provide insights into why certain recommendations are being made to a user based on three types of hints - user-aspect, POI-aspect, and interaction-aspect. We empirically validate our approach using two real-life datasets, and show promising performance improvements over several state-of-the-art methods.

Moving objects equipped with location-positioning devices continuously generate a large amount of spatio-temporal trajectory data. An interesting finding over a trajectory stream is a group of objects that are travelling together for a certain period of time. We observe that existing studies on mining co-moving objects do not consider an important correlation between co-moving objects, which is the reoccurrence of the co-moving pattern. In this study, we propose the problem of finding recurrent co-moving patterns from streaming trajectories, enabling us to discover recent co-moving patterns that are repeated within a given time period. Experimental results on real-life trajectory data verify the efficiency and effectiveness of our method.

In this paper, we study the problem of origin-destination (OD) travel time estimation where the OD input consists of an OD pair and a departure time. We propose a novel neural net- work based prediction model that fully exploits an important fact neglected by the literature – for a past OD trip its travel time is usually affiliated with the trajectory it travels along, whereas it does not exist during prediction. At the training phase, our goal is to design novel representations for the OD input and its affiliated trajectory, such that they are close to each other in the latent space. First, we match the OD pairs and their affiliated (historical) trajectories to road networks, and utilize road segment embeddings to represent their spa- tial properties. Later, we match the timestamps associated with trajectories to time slots and utilize time slot embed- dings to represent the temporal properties. Next, we build a temporal graph to capture the weekly and daily periodicity of time slot embeddings. Last, we design an effective encoding to represent the spatial and temporal properties of trajecto- ries. To bind each OD input to its affiliated trajectory, we also encode the OD input into a hidden representation, and make the hidden representation close to the spatio-temporal representation of the trajectory. At the prediction phase, we only use the OD input, get the hidden representation of the OD input, and use it to generate the travel time. Extensive experiments on real datasets show that our method achieves high effectiveness and outperforms existing methods.

In this modern era, traffic congestion has become a major source of negative economic and environmental impact for urban areas worldwide. One of the most efficient ways to mitigate traffic congestion is through future traffic prediction. The field of traffic prediction has evolved greatly ever since its inception in the late 70s. Earlier studies mainly use classical statistical models such as ARIMA and its variants. Then, researchers started to focus on machine learning models due to their power and flexibility. As theoretical and technological advances emerge, we enter the era of deep neural network, which gained popularity due to its sheer prediction power which can be attributed to the complex and deep structure. Despite the popularity of deep neural network models in the field of traffic prediction, literature surveys of them are rare. In this work, we present an up to date survey of deep neural network for traffic prediction. We will provide a detailed explanation of popular deep neural network architectures commonly used in the traffic flow prediction literatures, categorize and describe the literatures themselves, present an overview of the commonalities and differences between the different work, and finally provide a discussion regarding the challenges and future directions for this field.

Network embedding is an effective method to learn low-dimensional representations of nodes, which can be applied to various real-life applications such as visualization, node clas- sification, and link prediction. Although significant progress has been made on this problem in recent years, several important challenges remain, such as how to properly capture temporal information in evolving networks. In practice, most networks are continually evolving. Some networks only add new edges or nodes such as authorship networks, while others support removal of nodes or edges such as internet data routing. If patterns exist in the changes of the network structure, we can better understand the relationships between nodes and the evolution of the network, which can be further leveraged to learn node representations with more meaningful information. In this paper, we propose an Embedding via Historical Neighborhoods Aggregation (EHNA) method. More specifically, we first propose a temporal random walk which can identify relevant nodes in historical neighborhoods which have impact on edge formations. Then we present a deep learning model which uses custom atten- tion mechanisms to generate node embeddings to directly capture temporal information in the underlying feature representations. We conduct extensive experiments on real-world datasets, and the results demonstrate the effectiveness of our new approach in the network reconstruction and the link prediction tasks.

Detecting anomalous trajectory has become an im- portant and fundamental concern in many real-world applica- tions. However, most of the existing studies 1) cannot handle the complexity and variety of trajectory data and 2) do not support efficient anomaly detection in an online manner. To this end, we propose a novel model, namely Gaussian Mixture Variational Sequence AutoEncoder (GM-VSAE), to tackle these challenges. Our GM-VSAE model is able to (1) capture complex sequential information enclosed in trajectories, (2) discover different types of normal routes from trajectories and represent them in a continuous latent space, and (3) support efficient online detection via trajectory generation. Our experiments on two real-world datasets demonstrate that GM-VSAE is more effective than the state-of-the-art baselines and is efficient for online anomalous trajectory detection.

Knowledge bases (KBs) store rich yet heterogeneous entities and facts. Entity resolution (ER) aims to identify entities in KBs which refer to the same real-world object. Recent studies have shown significant benefits of involving humans in the loop of ER. They often resolve entities with pairwise similarity measures over attribute values and resort to the crowds to label uncertain ones. However, existing methods still suffer from high labor costs and insufficient labeling to some extent. In this paper, we propose a novel approach called crowdsourced collective ER, which leverages the relationships between entities to infer matches jointly rather than independently. Specifically, it iteratively asks human workers to label picked entity pairs and propagates the labeling information to their neighbors in distance. During this process, we address the problems of candidate entity pruning, probabilistic propagation, optimal question selection and error- tolerant truth inference. Our experiments on real-world datasets demonstrate that, compared with state-of-the-art methods, our approach achieves superior accuracy with much less labeling.

In this paper, we propose and study a variant of the dynamic ridesharing problem with a specific focus on peak hours. Given a set of drivers and a set of rider requests, we aim to match drivers to each rider request by achieving two objectives, maximizing the served rate and minimizing the total additional distance, subject to a series of spatio-temporal constraints. Our problem can be distinguished from existing ridesharing solutions in three aspects, (1) Previous work did not fully explore the impact of peak travel periods where the number of rider requests is much greater than the number of available drivers. (2) Existing ridesharing solutions usually rely on single objective optimization techniques, such as minimizing the total travel cost (either distance or time). (3) When evaluating the overall system performance, the runtime spent on updating drivers’ trip schedules as per newly coming rider requests should be incorporated, while it is unfortunately excluded by most existing solutions. In order to achieve our goal, we propose an underlying index structure on top of a partitioned road network, and compute the lower bounds of the shortest path distance between any two vertices. Using the proposed index together with a set of new pruning rules, we develop an efficient algorithm to dynamically include new riders directly into an existing trip schedule of a driver. In order to respond to new rider requests more effectively, we propose two algorithms that bilaterally match drivers with rider requests. Finally, we perform extensive experiments on a large-scale test collection to validate the effectiveness and efficiency of the proposed methods.

In this paper we propose and study the problem of optimizing the influence of outdoor advertising (ad) when impression counts are taken into consideration. Given a database U of billboards, each of which has a location and a non-uniform cost, a trajectory database T and a budget B, it aims to find a set of billboards that has the maximum influence under the budget. In line with the advertising consumer behavior studies, we adopt the logistic function to take into account the impression counts of an ad (placed at different billboards) to a user trajectory when defining the influence measurement. However, this poses two challenges - (1) our problem is NP-hard to approximate within a factor of O(|T|1-ε) for any ε>0 in polynomial time; (2) the influence measurement is non-submodular, which means a straightforward greedy approach is not applicable. Therefore, we propose a tangent line based algorithm to compute a submodular function to estimate the upper bound of influence. Henceforth, we introduce a branch-and-bound framework with a θ-termination condition, achieving θ2/(1 - 1/e) approximation ratio. However, this framework is time-consuming when |U| is huge. Thus, we further optimize it with a progressive pruning upper bound estimation approach which achieves θ2/(1 - 1/e - ε) approximation ratio and significantly decreases the running-time. We conduct the experiments on real-world billboard and trajectory datasets, and show that the proposed approaches outperform the baselines by 95% in effectiveness. Moreover, the optimized approach is around two orders of magnitude faster than the original framework.

Data analysts spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. Similarity-based query processing is an important way to tolerate the errors and inconsistencies. However, similarity-based query processing is rather costly and traditional database cannot afford such expensive requirement. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports four core similarity operations, i.e., similarity selection, similarity join, top-k selection and top-k join. Dima extends SQL for users to easily invoke these similarity-based operations in their data analysis tasks. To avoid expensive data transmission in a distributed environment, we propose balance-aware signatures where two records are similar if they share common signatures, and we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support similarity operations. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support complex similarity-based query processing on large-scale datasets. We have conducted extensive experiments on four real-world datasets. Experimental results show that Dima outperforms state-of-the-art studies by 1–3 orders of magnitude and has good scalability.

Influence maximization (IM) continues to be a key research problem in social networks. The goal is to find a small seed set of target users that have the greatest influence in the network under various stochastic diffusion models. While significant progress has been made on the IM problem in recent years, several interesting challenges remain. For example, social networks in reality are constantly evolving, and “important” users with the most influence also change over time. As a result, several recent studies have proposed approaches to update the seed set as the social networks evolve. However, this seed set is not guaranteed to be the best seed set over a period of time. In this paper we study the problem of Distinct Influence Maximization (DIM) where the goal is to identify a seed set of influencers who maximize the number of distinct users influenced over a predefined window of time. Our new approach allows social network providers to make fewer incremental changes to targeted advertising while still maximizing the coverage of the advertisements. It also provides finer grained control over service level agreements where a certain number of impressions for an advertisement must be displayed in a specific time period. We propose two different strategies HCS and VCS with novel graph compression techniques to solve this problem. Additionally, VCS can also be applied directly to the traditional IM problem. Extensive experiments on real-world datasets verify the efficiency, accuracy and scalability of our solutions on both the DIM and IM problems.

A multi-attribute trajectory consists of a sequence of time-stamped locations and a set of attributes that characterize diverse aspects of the corresponding moving object. In this paper, we study continuous range queries over multi-attribute trajectories. Such a query returns the objects whose attributes contain expected values and whose locations are always within a distance threshold to the query trajectory during the entire overlapping time period. To efficiently answer the query, an optimal method of partitioning the trajectories is proposed and an index structure is developed to support the combined search of spatio-temporal parameters and attribute values. We provide a general solution that is able to process multi-attribute trajectories as well as traditional trajectories without attributes. We carry out comprehensive experiments in a prototype database system to evaluate the efficiency and scalability of our designs. The experimental results show that our approach outperforms five alternative approaches by a factor of 5-50x on large datasets.

Understanding urban areas of interest (AOIs) is essential to decision making in various urban planning and exploration tasks. Such AOIs can be computed based on the geographic points that satisfy the user query. In this demo, we present an interactive visualization system of urban AOIs, supported by a parameter-free and efficient footprint method called AOI-shapes. Compared to state-of-the-art footprint methods, the proposed AOI-shapes (i) is parameter-free, (ii) is able to recognize multiple regions/outliers, (iii) can detect inner holes, and (iv) supports the incremental method. We demonstrate the effectiveness and efficiency of the proposed AOI-shapes based on a real-world real estate dataset in Australia.

In this paper, we would like to demonstrate an intelligent traffic analytics system called T4, which enables intelligent analytics over real-time and historical trajectories from vehicles. At the front end, we visualize the current traffic flow and result trajectories of different types of queries, as well as the histograms of traffic flow and traffic lights. At the back end, T4 is able to support multiple types of common queries over trajectories, with compact storage, efficient index and fast pruning algorithms. The output of those queries can be used for further monitoring and analytics purposes. Moreover, we train the deep models for traffic flow prediction and traffic light control to reduce traffic congestion

Given a set of billboards U (each with a location and a cost), a database of trajectories T and a budget L, find a set of billboards within the budget to influence the largest number of trajectories.

This paper presents a new trajectory search engine called Torch for querying road network trajectory data. Torch is able to efficiently process two types of typical queries (similarity search and Boolean search), and support a wide variety of trajectory similarity functions. Additionally, we propose a new similarity function LORS in Torch to measure the similarity in a more effective and efficient manner. Indexing and search in Torch works as follows. First, each raw vehicle trajectory is transformed to a set of road segments (edges) and a set of crossings (vertices) on the road network. Then a lightweight edge and vertex index called LEVI is built. Given a query, a filtering framework over LEVI is used to dynamically prune the trajectory search space based on the similarity measure imposed. Finally, the result set (ranked or Boolean) is returned. Extensive experiments on real trajectory datasets verify the effectiveness and efficiency of Torch.

In this paper, we demonstrate a trip planning system called TISP, which enables users’ interactive exploration of POIs and trajectories in their incremental trip planning. At the back end, TISP is able to support seven types of common queries over spatial-only, spatial-textual and textual-only data, based on our proposed unified indexing and search paradigm. At the front end, we propose novel visualization designs to present the result of different types of queries; our user-friendly interaction designs allow users to construct further queries without inputting any text.

Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.

In this demonstration we present POIsam, a visualization system supporting the following desirable features - representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints aim to efficiently select a small set of representative objects from the current region of user’s interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. One unique feature of POISam is that any similarity metrics can be plugged into POISam to meet the user’s specific needs in different scenarios. The latter two consistencies are fundamental challenges to efficiently update the selection result w.r.t. user’s zoom in, zoom out and panning operations when they interact with the map. POISam drops a common assumption from all previous work, i.e. the zoom levels and region cells are pre-defined and indexed, and objects are selected from such region cells at a particular zoom level rather than from user’s current region of interest which in most cases do not correspond to the pre-defined cells. It results in extra challenge as we need to do object selection via online computation. To our best knowledge, this is the first system that is able to meet all the four features to achieve an interactive visualization map exploration system.

With the proliferation of mobile devices, large collections of geospatial data are becoming available, such as geo-tagged photos. Map rendering systems play an important role in presenting such large geospatial datasets to end users. We propose that such systems should support the following desirable features - representativeness, visibility constraint, zooming consistency, and panning consistency. The first two constraints are fundamental challenges to a map exploration system, which aims to efficiently select a small set of representative objects from the current region of user’s interest, and any two selected objects should not be too close to each other for users to distinguish in the limited space of a screen. We formalize it as the Spatial Object Selection (sos) problem, prove that it is an NP-hard problem, and develop a novel approximation algorithm with performance guarantees. To further support interactive exploration of geospatial data on maps, we propose the Interactive sos (isos) problem, in which we enrich the sos problem with the zooming consistency and panning consistency constraints. The objective of isos is to provide seamless experience for end-users to interactively explore the data by navigating the map. We extend our algorithm for the sos problem to solve the isos problem, and propose a new strategy based on pre-fetching to significantly enhance the efficiency. Finally we have conducted extensive experiments to show the efficiency and scalability of our approach.

Trajectory analytics can benefit many real-world applications, e.g., frequent trajectory based navigation systems, road planning, car pooling, and transportation optimizations. Existing algorithms focus on optimizing this problem in a single machine. However, the amount of trajectories exceeds the storage and processing capability of a single machine, and it calls for large-scale trajectory analytics in distributed environments. The distributed trajectory analytics faces challenges of data locality aware partitioning, load balance, easy-to-use interface, and versatility to support various trajectory similarity functions. To address these challenges, we propose a distributed in-memory trajectory analytics system DITA. We propose an effective partitioning method, global index and local index, to address the data locality problem. We devise cost-based techniques to balance the workload. We develop a filter-verification framework to improve the performance. Moreover, DITA can support most of existing similarity functions to quantify the similarity between trajectories. We integrate our framework seamlessly into Spark SQL, and make it support SQL and DataFrame API interfaces. We have conducted extensive experiments on real world datasets, and experimental results show that DITA outperforms existing distributed trajectory similarity search and join approaches significantly.

Since the mid-2000s, everal indexing techniques have been proposed to efficiently answer top-k spatial-textual queries. However, all of these approaches focus on answering one query at a time. In contrast, how to design efficient algorithms that can exploit similarities between incoming queries to improve performance has received little attention. In this article, we study a series of efficient approaches to batch process multiple top-k spatial-textual queries concurrently. We carefully design a variety of indexing structures for the problem space by exploring the effect of prioritizing spatial and textual properties on system performance. Specifically, we present an efficient traversal method, SF-Sep, over an existing space-prioritized index structure. Then, we propose a new space-prioritized index structure, the MIR-Tree to support a filter-and-refine based technique, SF-Grp. To support the processing of text-intensive data, we propose an augmented, inverted indexing structure that can easily be added into existing text search engine architectures and a novel traversal method for batch processing of the queries. In all of these approaches, the goal is to improve the overall performance by sharing the I/O costs of similar queries. Finally, we demonstrate significant I/O savings in our algorithms over traditional approaches by extensive experiments on three real datasets and compare how properties of different datasets affect the performance. Many applications in streaming, micro-batching of continuous queries, and privacy-aware search can benefit from this line of work.

GPS enables mobile devices to continuously provide new opportunities to improve our daily lives. For example, the data collected in applications created by Uber or Public Transport Authorities can be used to plan transportation routes, estimate capacities, and proactively identify low coverage areas. In this paper, we study a new kind of query-Reverse k Nearest Neighbor Search over Trajectories (RkNNT), which can be used for route planning and capacity estimation. Given a set of existing routes D_R , a set of passenger transitions D_T , and a query route Q, an RkNNT query returns all transitions that take Q as one of its k nearest travel routes. To solve the problem, we first develop an index to handle dynamic trajectory updates, so that the most up-to-date transition data are available for answering an RkNNT query. Then we introduce a filter refinement framework for processing RkNNT queries using the proposed indexes. Next, we show how to use RkNNT to solve the optimal route planning problem MaxRkNNT (MinRkNNT), which is to search for the optimal route from a start location to an end location that could attract the maximum (or minimum) number of passengers based on a predefined travel distance threshold. Experiments on real datasets demonstrate the efficiency and scalability of our approaches. To the best of our knowledge, this is the first work to study the RkNNT problem for route planning.

In this paper, we present HomeSeeker, an interactive visual analytics system to serve users with different backgrounds of the local real estate market and meet different degrees of user requirements. As a result, HomeSeeker augments existing commercial systems to help users discover hidden patterns, link various location-centered data to the price, as well as explore, filter and compare the properties, in order to easily find their preferred properties. In particular, we make the following contributions - (1) We present a problem abstraction for designing visualizations that help home buyers analyse the real estate data. Specifically, our data abstraction integrates heterogeneous data from different channels into a location-centred integrated real estate dataset. (2) We propose an interactive visual analytic procedure to help less informed users gradually learn about the local real estate market, upon which users exploit this learned knowledge to specify their individual requirements in property seeking. (3) We propose a series of designs to visualize properties/suburbs in different dimensions and in different granularities. We have collected, integrated and cleaned last 10 year’s real estate sold records in Australia as well as their location-related education, facility and transportation profiles, to generate a real multi-dimensional data repository, and implemented a system prototype for public access. At last, we present case studies based on real-world datasets and real scenario to demonstrate the usefulness and effectiveness of our system.

The problem of optimal location selection based on reverse k nearest neighbor (RkNN) queries has been extensively studied in spatial databases. In this work, we present a related query, denoted as a Maximized Bichromatic Reverse Spatial Textual k Nearest Neighbor (MaxST) query, that finds an optimal location and a set of keywords for an object so that the object is a kNN object for as many users as possible. Such a query has many practical applications including advertisements, where the query is to find the location and the text contents to include in an advertisement so that it is relevant to the maximum number of users. The visibility of the advertisements also has an important role in the users’ interests. In this work, we address two instances of the spatial relevance when ranking items - (1) the Euclidean distance and (2) the visibility. We carefully design a series of index structures and approaches to answer the MaxST for both instances. Specifically, we present (1) the Grp- topk approach that requires the computation of the top-k objects for all of the users first and then applies various pruning techniques to find the optimal location and keywords; (2) the Indiv- U approach, where we use similarity estimations to avoid computing the top-k objects of the users that cannot be a final result; and (3) the Index- U approach where we propose a hierarchical index structure over the users to improve pruning. We show that the keyword selection component in MaxST queries is NP-hard and present both approximate and exact solutions for the problem.

Urban data (e.g., real estate data, crime data) often have multiple attributes which are highly geography-related. With the scale of data increases, directly visualizing millions of individual data points on top of a map would overwhelm users’ perceptual and cognitive capacity and lead to high latency when users interact with the data. In this demo, we present ConvexCubes, a system that supports interactive visualization of large-scale multidimensional urban data in a multi-granularity way. Comparing to state-of-theart visualization-driven data structures, it exploits real-world geographic semantics (e.g., country, state, city) rather than using gridbased aggregation. Instead of calculating everything on demand, ConvexCubes utilizes existing visualization results to efficiently support different kinds of user interactions, such as zooming & panning, filtering and granularity control

This paper studies the location-based web search and aims to build a unified processing paradigm for two purposes - (1) efficiently support each of the various types of location-based queries (kNN query,top-k spatial-textual query, etc.) on two major forms of geo-tagged data, i.e., spatial point data such as geo-tagged web documents, and spatial trajectory data such as a sequence of geo-tagged travel blogs by a user; (2) support interactive search to provide quick response for a query session, within which a user usually keeps refining her query by either issuing different query types or specifying different constraints (e.g., adding a keyword and/or location, changing the choice of k, etc.) until she finds the desired results. To achieve this goal, we first propose a general Top-k query called Monotone Aggregate Spatial Keyword query-MASK, which is able to cover most types of location-based web search. Next, we develop a unified indexing (called Textual-Grid-Point Inverted Index) and query processing paradigm (called ETAIL Algorithm) to answer a single MASK query efficiently. Furthermore, we extend ETAIL to provide interactive search for multiple queries within one query session, by exploiting the commonality of textual and/or spatial dimension among queries. Last, extensive experiments on four real datasets verify the robustness and efficiency of our approach.

Main memory index is built with the assumption that the RAM is sufficiently large to hold data. Due to the volatility and high unit price of main memory, indices under secondary memory such as SSD and HDD are widely used. However, the I/O operation with main memory is still the bottleneck for query efficiency. In this paper, we propose a self-tuning indexing scheme called Tide-tree for RAM/Disk-based hybrid storage system. Tide-tree aims to overcome the obstacles main memory and disk-based indices face, and performs like the tide to achieve a double-win in space and performance, which is self-adaptive with respect to the running environment. Particularly, Tide-tree delaminates the tree structure adaptively with high efficiency based on storage sense, and applies an effective self-tuning algorithm to dynamically load various nodes into main memory. We employ memory mapping technology to solve the persistent problem of main memory index, and improves the efficiency of data synchronism and pointer translation. To further enhance the independence of Tide-tree, we employ the index head and the level address table to manage the whole index. With the index head, three efficient operations are proposed, namely index rebuild, index load and range search. We have conducted extensive experiments to compare the Tide-tree with several state-of-the-art indices, and the results have validated the high efficiency, reusability and stability of Tide-tree.

Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e., similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support efficient similarity search and join. Since Spark is one of the widely adopted distributed in-memory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.

In this paper, we propose and study the problem of top-m rank aggregation of spatial objects in streaming queries, where, given a set of objects O, a stream of spatial queries (kNN or range), the goal is to report the m objects with the highest aggregate rank. The rank of an object with respect to an individual query is computed based on its distance from the query location, and the aggregate rank is computed from all of the individual rank orderings. In order to solve this problem, we show how to upper and lower bound the rank of an object for any unseen query. Then we propose an approximation solution to continuously monitor the top-m objects efficiently, for which we design an Inverted Rank File (IRF) index to guarantee the error bound of the solution. In particular, we propose the notion of safe ranking to determine whether the current result is still valid or not when new queries arrive, and propose the notion of validation objects to limit the number of objects to update in the top-m results. We also propose an exact solution for applications where an approximate solution is not sufficient. Last, we conduct extensive experiments to verify the efficiency and effectiveness of our solutions. This is a fundamental problem that draws inspiration from three different domains - rank aggregation, continuous queries and spatial databases, and the solution can be used to monitor the importance / popularity of spatial objects, which in turn can provide new analytical tools for spatial data.

We study a new type of spatial-textual trajectory search - the Exemplar Trajectory Query (ETQ), which specifies one or more places to visit, and descriptions of activities at each place. Our goal is to efficiently find the top-k trajectories by computing spatial and textual similarity at each point. The computational cost for pointwise matching is significantly higher than previous approaches. Therefore, we introduce an incremental pruning baseline and explore how to adaptively tune our approach, introducing a gap-based optimization and a novel twolevel threshold algorithm to improve efficiency. Our proposed methods support order-sensitive ETQ with a minor extension. Experiments on two datasets verify the efficiency and scalability of our proposed solution.

Energy efficiency of cloud computing has been given great attention more than ever before. One of the challenges is how to strike a balance between minimizing the energy consumption and meeting the quality of services such as satisfying performance and resource availability in a timely manner. Many studies based on the online migration technology attempt to move virtual machine from low utilization of hosts and then switch it off with the purpose of reducing energy consumption. In this paper, we aim to develop an adaptive task scheduling strategy. In particular, we first model the virtual machine energy from the perspective of the cloud task scheduling, then we propose a genetic algorithm to achieve adaptive regulations for different requirements of energy and performance in cloud tasks (E-PAGA). Then we design two types of the fitness function for choosing the next generation with different preferences on energy and performance. As a result, we can adaptively adjust the energy and performance target before assigning the task in cloud, which is able to meet various requirements from different users. From the extensive experiments, we pinpoint several important observations which are useful in configuring real cloud data centers - 1) we prove that guaranteeing the minimum total task time usually leads to low energy consumption to some extent; 2) we must pay the price of the sacrificed performance if only taking into account the energy optimization; 3) we come to the conclusion that there is always an optimal condition of energy-efficiency ratio in the cloud data center, and more importantly the specific conditions of the optimal energy-efficiency ratio can be obtained.