LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries

With the proliferation of spatio-textual data, Top-k KNN spatial keyword queries (TkQs), which return a list of objects based on a ranking function that evaluates both spatial and textual relevance, have found many real-life applications. Existing geo-textual indexes for TkQs use traditional retrieval models like BM25 to compute text relevance and usually exploit a simple linear function to compute spatial relevance, but its effectiveness is limited. To improve effectiveness, several deep learning models have recently been proposed, but they suffer severe efficiency issues. To the best of our knowledge, there are no efficient indexes specifically designed to accelerate the top-k search process for these deep learning models. To tackle these issues, we propose a novel technique, which Learns to Index the Spatio-Textual data for answering embedding based spatial keyword queries (called LIST). LIST is featured with two novel components. Firstly, we propose a lightweight and effective relevance model that is capable of learning both textual and spatial relevance. Secondly, we introduce a novel machine learning based Approximate Nearest Neighbor Search (ANNS) index, which utilizes a new learning-to-cluster technique to group relevant queries and objects together while separating irrelevant queries and objects. Two key challenges in building an effective and efficient index are the absence of high-quality labels and unbalanced clustering results. We develop a novel pseudo-label generation technique to address the two challenges. Experimental results show that LIST significantly outperforms state-of-the-art methods on effectiveness, with improvements up to 19.21% and 12.79% in terms of NDCG@1 and Recall@10, and is three orders of magnitude faster than the most effective baseline.

翻译：随着空间文本数据的激增，基于空间和文本相关性排序函数返回对象列表的 Top-k KNN 空间关键词查询（TkQs）已在众多实际应用中得到推广。现有用于 TkQs 的地理文本索引采用 BM25 等传统检索模型计算文本相关性，并通常使用简单的线性函数计算空间相关性，但其有效性有限。为提升有效性，近期提出了若干深度学习模型，但这些模型存在严重的效率问题。据我们所知，目前尚无专门针对这些深度学习模型加速 top-k 搜索过程的高效索引。为解决这些问题，我们提出一种新技术——学习索引空间文本数据以支持基于嵌入的空间关键词查询（简称 LIST）。LIST 包含两个新颖组件：首先，我们提出一种轻量级且有效的相关性模型，能够同时学习文本和空间相关性；其次，我们引入一种基于机器学习的近似最近邻搜索（ANNS）索引，利用新的学习聚类技术将相关查询与对象分组，同时分离不相关查询与对象。构建高效索引的两个关键挑战是缺乏高质量标签和聚类结果不平衡。我们开发了一种新颖的伪标签生成技术来应对这两个挑战。实验结果表明，LIST 在有效性上显著优于现有方法，NDCG@1 和 Recall@10 分别提升高达 19.21% 和 12.79%，且比最有效的基线方法快三个数量级。