Test-Time Compute for Frozen Embedding Models through Agentic Program Search

Test-time compute is widely believed to benefit only large reasoning models, leaving small models with nothing to gain. We argue the opposite for dense retrieval, since modern small embedding models are distilled or adapted from large language model backbones and can inherit their latent test-time-compute potential. We ask how much retrieval quality a frozen embedding model gains at inference alone, with no auxiliary model and no parameters trained at deployment. An agentic loop in which a large language model writes programs over a frozen encoder API explores 144 candidates and yields twelve Pareto-optimal programs that trade inference compute for quality across cost ratios from $c{=}1.2$ to $14.7$, every one improving nDCG@10 on all 14 discovery tasks. The programs use no trainable parameters and recover classical retrieval primitives, among them reciprocal rank fusion, the Fisher linear discriminant, Rocchio pseudo-relevance feedback, and sentence-level MaxSim. Applied unmodified to nineteen held-out tasks and three unseen encoder families, a single fixed program improves the majority of tasks, with a positive median $Δ$nDCG@10 and a 54 to 57% win-rate at $c{\ge}4$, and the gains are largest on encoder families never seen during discovery. A matched-budget learned projection head trained on the same tasks does not transfer this way, improving in-domain retrieval by $+0.20$ to $+0.25$ nDCG@10 yet falling below baseline on every held-out encoder. Small embedding models therefore inherit usable test-time-compute potential, and a frozen encoder converts inference compute into retrieval gains that transfer to new corpora and encoders with no per-domain labels.

翻译：测试时计算普遍被认为仅有利于大型推理模型，而小型模型毫无收益。我们对此提出相反观点，尤其在密集检索场景中，因为现代小型嵌入模型是从大语言模型骨干网络蒸馏或适配而来，继承了其潜在测试时计算能力。我们探究：冻结状态的嵌入模型在仅使用推理环节、无辅助模型且不训练任何部署参数的情况下，能提升多少检索质量。通过大语言模型在冻结编码器API上编写程序的智能循环流程，我们探索了144个候选方案，最终得到12个帕累托最优程序，它们以推理计算量为代价在质量与成本比率$c=1.2$至$14.7$之间进行权衡，每个程序均能提升全部14个发现任务上的nDCG@10指标。这些程序不含可训练参数，并恢复了经典检索原语，包括倒数排序融合、Fisher线性判别分析、Rocchio伪相关反馈以及句子级MaxSim。当将单一固定程序原封不动应用于19个保留任务和3个未见编码器家族时，它能提升多数任务性能，在$c\ge4$条件下中位$Δ$nDCG@10为正且胜率达54%-57%，其中对发现阶段完全未见编码器家族的增益最为显著。而在相同任务上训练、具备匹配计算预算的可学习投影头并不具备这种迁移能力：虽能将域内检索nDCG@10提升$+0.20$至$+0.25$，但在所有保留编码器上均低于基线。因此，小型嵌入模型继承了可用的测试时计算潜力，冻结编码器能将推理计算量转化为检索增益，且无需域标签即可迁移至新语料库和编码器。