Spatial objects often come with textual information, such as Points of Interest (POIs) with their descriptions, which are referred to as geo-textual data. To retrieve such data, spatial keyword queries that take into account both spatial proximity and textual relevance have been extensively studied. Existing indexes designed for spatial keyword queries are mostly built based on the geo-textual data without considering the distribution of queries already received. However, previous studies have shown that utilizing the known query distribution can improve the index structure for future query processing. In this paper, we propose WISK, a learned index for spatial keyword queries, which self-adapts for optimizing querying costs given a query workload. One key challenge is how to utilize both structured spatial attributes and unstructured textual information during learning the index. We first divide the data objects into partitions, aiming to minimize the processing costs of the given query workload. We prove the NP-hardness of the partitioning problem and propose a machine learning model to find the optimal partitions. Then, to achieve more pruning power, we build a hierarchical structure based on the generated partitions in a bottom-up manner with a reinforcement learning-based approach. We conduct extensive experiments on real-world datasets and query workloads with various distributions, and the results show that WISK outperforms all competitors, achieving up to 8x speedup in querying time with comparable storage overhead.
翻译:空间对象通常附带文本信息,例如兴趣点(POI)及其描述,这被称为地理文本数据。为检索此类数据,兼顾空间邻近性与文本相关性的空间关键字查询已被广泛研究。现有针对空间关键字查询的索引大多基于地理文本数据构建,而未考虑已接收查询的分布。然而,先前研究表明,利用已知查询分布可优化未来查询处理的索引结构。本文提出WISK——一种面向空间关键字查询的学习索引,它能够根据查询工作负载自适应优化查询成本。其中关键挑战在于,如何在索引学习过程中同时利用结构化空间属性与非结构化文本信息。我们首先将数据对象划分为若干分区,旨在最小化给定查询工作负载的处理成本。我们证明了该分区问题的NP难性,并提出一种机器学习模型以寻找最优分区。随后,为获得更强的剪枝能力,我们基于生成的分区采用强化学习方法自底向上构建层次结构。我们在真实数据集及多种分布的查询工作负载上进行了广泛实验,结果表明WISK优于所有对比方法,在保持可比较存储开销的同时,实现了高达8倍的查询加速。