Large language model retrievers improve performance on complex queries, but their practical value depends on efficiency, robustness, and reliable confidence signals in addition to accuracy. We reproduce a reasoning-intensive retrieval benchmark (BRIGHT) across 12 tasks and 14 retrievers, and extend evaluation with cold-start indexing cost, query latency distributions and throughput, corpus scaling, robustness to controlled query perturbations, and confidence use (AUROC) for predicting query success. We also quantify \emph{reasoning overhead} by comparing standard queries to five provided reasoning-augmented variants, measuring accuracy gains relative to added latency. We find that some reasoning-specialized retrievers achieve strong effectiveness while remaining competitive in throughput, whereas several large LLM-based bi-encoders incur substantial latency for modest gains. Reasoning augmentation incurs minimal latency for sub-1B encoders but exhibits diminishing returns for top retrievers and may reduce performance on formal math/code domains. Confidence calibration is consistently weak across model families, indicating that raw retrieval scores are unreliable for downstream routing without additional calibration. We release all code and artifacts for reproducibility.
翻译:大型语言模型检索器在复杂查询中能够提升性能,但其实际价值除了准确性之外,还取决于效率、鲁棒性以及可靠的置信度信号。我们复现了一个面向推理密集型任务的检索基准(BRIGHT),涵盖12项任务和14种检索器,并扩展了评估维度,包括冷启动索引成本、查询延迟分布与吞吐量、语料库扩展性、对受控查询扰动的鲁棒性,以及用于预测查询成功率的置信度评估(AUROC)。我们还通过将标准查询与五种带推理增强的变体查询进行对比,量化了“推理开销”,即测量相对于额外延迟的准确率提升。我们发现,部分专精于推理的检索器在保持较强有效性的同时,其吞吐量也具备竞争力;而一些基于LLM的大型双编码器虽然准确率提升有限,却带来了显著延迟。对于参数规模小于10亿的编码器,推理增强带来的延迟极小,但对顶尖检索器而言,其边际效益递减,甚至在形式化数学/代码领域可能导致性能下降。不同模型家族的置信度校准普遍较弱,这表明原始检索得分在缺乏额外校准的情况下,难以可靠地用于下游路由决策。我们开放了所有代码与生成物,以确保可复现性。