Fixed-cardinality retrieval injects a constant top-K chunks into the generator regardless of query complexity, causing over-retrieval for narrow queries and under-retrieval for compositional ones. We describe ScoreGate, a lightweight score-space decision mechanism that controls retrieval cardinality at inference time using two scores already produced by the standard pipeline: bi-encoder similarity s_i and cross-encoder reranker score r_i, with no additional model inference calls required. Its core insight is that cross-encoder affirmation can rescue semantically relevant chunks that bi-encoder retrieval ranks poorly due to vocabulary mismatch -- a failure mode unaddressed by fixed-K or single-score thresholding. On MS MARCO (200 dev queries), ScoreGate achieves MRR@10 = 0.401 with 35% fewer retained chunks than Standard Top-K. On an internal benchmark (n=300, Fleiss' kappa=0.87), ScoreGate observed zero false positives (95% CI [96.4%, 100%]) at 97.77-99.34% recall, with 34.8% fewer tokens per query and only 31ms added latency. Results on both MS MARCO and real-world production traffic suggest that adaptive retrieval cardinality can improve retrieval efficiency without degrading retrieval quality.
翻译:固定基数检索无论查询复杂度如何,均向生成器注入恒定数量的top-K分块,导致窄域查询出现过检索、复合查询出现欠检索问题。本文提出ScoreGate——一种轻量级分数空间决策机制,利用标准流程已生成的双编码器相似度s_i与交叉编码器重排序分数r_i,在推理阶段控制检索基数,无需额外模型推理调用。其核心洞见在于:交叉编码器的确认机制可挽救因词汇不匹配而被双编码器检索排名低但语义相关的分块——这是固定K值或单分数阈值法未能解决的失效模式。在MS MARCO(200个开发查询)上,ScoreGate以较标准Top-K少35%的保留分块实现MRR@10=0.401;在内部基准测试(n=300,Fleiss' kappa=0.87)中,ScoreGate在召回率为97.77%-99.34%时观测到零假阳性(95%CI [96.4%,100%]),每查询减少34.8%的token,仅增加31ms延迟。在MS MARCO与实际生产流量上的结果均表明,自适应检索基数可在不降低检索质量的前提下提升检索效率。