In this vision paper, we propose a shift in perspective for improving the effectiveness of similarity search. Rather than focusing solely on enhancing the data quality, particularly machine learning-generated embeddings, we advocate for a more comprehensive approach that also enhances the underpinning search mechanisms. We highlight three novel avenues that call for a redefinition of the similarity search problem: exploiting implicit data structures and distributions, engaging users in an iterative feedback loop, and moving beyond a single query vector. These novel pathways have gained relevance in emerging applications such as large-scale language models, video clip retrieval, and data labeling. We discuss the corresponding research challenges posed by these new problem areas and share insights from our preliminary discoveries.
翻译:在这篇愿景论文中,我们提出转变视角,以提升相似性搜索的有效性。不同于仅专注于增强数据质量(尤其是机器学习生成的嵌入向量),我们主张采用一种更全面的方法,同时改进支撑搜索的机制。我们强调三个新颖的研究方向,这些方向要求重新定义相似性搜索问题:利用隐含的数据结构与分布、引导用户参与迭代反馈循环、以及超越单一查询向量的限制。这些新路径在大规模语言模型、视频片段检索和数据标注等新兴应用中已展现出其相关性。我们探讨了这些新问题领域所带来的相应研究挑战,并分享了初步发现中的见解。