Late-interaction retrieval models like ColBERT achieve superior accuracy by enabling token-level interactions, but their computational cost hinders scalability and integration with Approximate Nearest Neighbor Search (ANNS). We introduce FastLane, a novel retrieval framework that dynamically routes queries to their most informative representations, eliminating redundant token comparisons. FastLane employs a learnable routing mechanism optimized alongside the embedding model, leveraging self-attention and differentiable selection to maximize efficiency. Our approach reduces computational complexity by up to 30x while maintaining competitive retrieval performance. By bridging late-interaction models with ANNS, FastLane enables scalable, low-latency retrieval, making it feasible for large-scale applications such as search engines, recommendation systems, and question-answering platforms. This work opens pathways for multi-lingual, multi-modal, and long-context retrieval, pushing the frontier of efficient and adaptive information retrieval.
翻译:ColBERT 等延迟交互检索模型通过实现词元级交互获得了卓越的准确性,但其计算成本阻碍了可扩展性以及与近似最近邻搜索(ANNS)的集成。我们提出了 FastLane,一种新颖的检索框架,它动态地将查询路由至其信息量最大的表示,从而消除冗余的词元比较。FastLane 采用一种可学习的路由机制,该机制与嵌入模型一同优化,利用自注意力和可微分选择来最大化效率。我们的方法在保持有竞争力检索性能的同时,将计算复杂度降低了高达 30 倍。通过将延迟交互模型与 ANNS 桥接起来,FastLane 实现了可扩展、低延迟的检索,使其适用于搜索引擎、推荐系统和问答平台等大规模应用。这项工作为多语言、多模态和长上下文检索开辟了道路,推动了高效自适应信息检索的前沿发展。