While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks. In this work, we focus on two topics in Late Interaction retrieval: a length bias that arises when using multi-vector scoring, and the similarity distribution beyond the best scores pooled by the MaxSim operator. We analyze these behaviors for state-of-the-art models on the NanoBEIR benchmark. Results show that while the theoretical length bias of causal Late Interaction models holds in practice, bi-directional models can also suffer from it in extreme cases. We also note that no significant similarity trend lies beyond the top-1 document token, validating that the MaxSim operator efficiently exploits the token-level similarity scores.
翻译:尽管晚期交互模型展现出强大的检索性能,其许多底层动态机制仍未得到充分研究,这可能掩盖性能瓶颈。本研究聚焦于晚期交互检索中的两个主题:使用多向量评分时产生的长度偏差,以及MaxSim算子聚合的最佳得分之外的相似度分布。我们针对NanoBEIR基准测试中的前沿模型分析了这些行为。结果表明:虽然因果晚期交互模型的理论长度偏差在实践中成立,但双向模型在极端情况下也可能受此影响。同时我们注意到,在文档排名第一的token之外不存在显著相似度趋势,这验证了MaxSim算子能有效利用token级相似度分数。