LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries

With the proliferation of spatio-textual data, Top-k KNN spatial keyword queries (TkQs), which return a list of objects based on a ranking function that considers both spatial and textual relevance, have found many real-life applications. To efficiently handle TkQs, many indexes have been developed, but the effectiveness of TkQ is limited. To improve effectiveness, several deep learning models have recently been proposed, but they suffer severe efficiency issues and there are no efficient indexes specifically designed to accelerate the top-k search process for these deep learning models. To tackle these issues, we consider embedding based spatial keyword queries, which capture the semantic meaning of query keywords and object descriptions in two separate embeddings to evaluate textual relevance. Although various models can be used to generate these embeddings, no indexes have been specifically designed for such queries. To fill this gap, we propose LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data. LIST utilizes a new learning-to-cluster technique to group relevant queries and objects together while separating irrelevant queries and objects. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced clustering results. We develop a novel pseudo-label generation technique to address the two challenges. Additionally, we introduce a learning based spatial relevance model that can integrates with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST.

翻译：随着时空文本数据的激增，Top-k KNN空间关键词查询（TkQs）——基于同时考虑空间相关性和文本相关性的排序函数返回对象列表——已在众多现实应用中得到使用。为高效处理TkQs，已有多种索引被开发出来，但TkQ的检索效果有限。为提高效果，近期已有若干深度学习模型被提出，但它们存在严重的效率问题，且目前没有专门为加速这些深度学习模型的Top-k搜索过程而设计的高效索引。为解决这些问题，我们研究基于嵌入的空间关键词查询，该查询通过两个独立的嵌入捕获查询关键词和对象描述的语义信息以评估文本相关性。尽管可以使用多种模型来生成这些嵌入，但目前尚无专门为此类查询设计的索引。为填补这一空白，我们提出LIST，一种新颖的基于机器学习的近似最近邻搜索索引，它学习对时空文本数据进行索引。LIST采用一种新的学习聚类技术，将相关的查询和对象聚集在一起，同时分离不相关的查询和对象。构建一个高效且有效的索引面临两个关键挑战，即高质量标签的缺失和聚类结果的不平衡。我们开发了一种新颖的伪标签生成技术以应对这两个挑战。此外，我们引入了一种基于学习的空间相关性模型，该模型可与多种文本相关性模型集成，形成一个轻量级但有效的相关性评分，用于对LIST检索到的对象进行重排序。