Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.
翻译:自适应提示与程序搜索使得大语言模型评估对选择过程敏感。一旦基准测试项目在调优过程中被重复使用,观测到的“赢家”得分无法估计完整的“调优-部署”流程在新数据上的表现。我们针对显式调优预算下的这一流程级目标进行了推断研究。我们提出SIREN(选择感知重复分割报告协议),该协议冻结搜索后的候选列表,将分割选择与留存评估分离,并采用项目级高斯乘子自举法进行不确定性量化。在固定候选列表且稳定选择的情况下,该估计量具有一阶项目级表示,自举法可在有限预算网格上生成有效的联合推断。这支持了流程-性能曲线的置信区间,以及预定义的等预算和跨预算比较。受控模拟与MMLU-Pro调优实验表明,基于“赢家”的报告可能过于乐观并改变部署结论,而SIREN始终接近有限样本的报告目标。