Spatial objects often come with textual information, such as Points of Interest (POIs) with their descriptions, which are referred to as geo-textual data. To retrieve such data, spatial keyword queries that take into account both spatial proximity and textual relevance have been extensively studied. Existing indexes designed for spatial keyword queries are mostly built based on the geo-textual data without considering the distribution of queries already received. However, previous studies have shown that utilizing the known query distribution can improve the index structure for future query processing. In this paper, we propose WISK, a learned index for spatial keyword queries, which self-adapts for optimizing querying costs given a query workload. One key challenge is how to utilize both structured spatial attributes and unstructured textual information during learning the index. We first divide the data objects into partitions, aiming to minimize the processing costs of the given query workload. We prove the NP-hardness of the partitioning problem and propose a machine learning model to find the optimal partitions. Then, to achieve more pruning power, we build a hierarchical structure based on the generated partitions in a bottom-up manner with a reinforcement learning-based approach. We conduct extensive experiments on real-world datasets and query workloads with various distributions, and the results show that WISK outperforms all competitors, achieving up to 8x speedup in querying time with comparable storage overhead.
翻译:空间对象通常附带文本信息,例如兴趣点及其描述,这类数据被称为地理文本数据。为检索此类数据,综合考虑空间邻近性与文本相关性的空间关键词查询已被广泛研究。现有空间关键词查询索引多基于地理文本数据构建,而未考虑已接收查询的分布。然而,已有研究表明,利用已知查询分布可优化索引结构以提升未来查询处理效率。本文提出WISK——一种面向空间关键词查询的自适应学习索引,旨在根据查询负载优化查询成本。其中关键挑战在于如何在索引学习过程中同时利用结构化空间属性与非结构化文本信息。我们首先将数据对象划分为分区,以最小化给定查询负载的处理成本,并证明该划分问题的NP-hard性,进而提出机器学习模型求解最优分区。为增强剪枝能力,我们基于生成的分区采用强化学习方法自底向上构建层次结构。基于真实数据集及不同分布的查询负载进行的大量实验表明,WISK在所有对比方法中表现最优,查询速度提升达8倍,且存储开销相当。