Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability -- typically introduced in post-training -- to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models -- trained with various pre-training recipes -- on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

翻译：基准分数常常无法准确反映大语言模型（LLM）的知识水平，原因在于这些分数依赖于模型遵循特定格式要求的能力等因素。这尤其对那些可能知道正确答案但缺乏按指令组织答案能力的基座模型不利——而这种能力通常是在后训练阶段引入的。为解决这一问题，我们提出了软提示微调，这是一种高效、公平且与架构无关的模型评估方法。通过在短时间微调过程中仅优化10个软提示向量（对于7B模型约占0.0006%参数），可以使模型适应特定的基准格式，缩小格式遵循能力上的差距，并确保基准分数能够准确反映模型的内在知识。这使得我们能够公平比较采用不同预训练方案训练的各类基座模型在基准上的表现，而无需完整的后训练过程。我们在7个模型和7个数据集上评估了软提示微调。结果表明：(a) 软提示微调在80步（约640个样本）内即可饱和格式遵循能力，效率极高；(b) 软提示微调显著优于零样本和少样本提示，能揭示标准提示方法无法发现的基座模型知识；(c) 即使是经过后训练的模型也能从软提示中受益，以最大化格式合规性；(d) 与零样本和少样本基线相比，使用软提示的基座模型性能更可靠地预测了后训练模型的排名，为评估下游模型质量提供了低成本代理。我们的贡献包括：(1) 分离格式遵循能力与知识准确性的评估指标；(2) 更公平的LLM知识基准测试协议；(3) 一种在LLM开发早期识别最优预训练策略的成本及内存高效方案。