Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.
翻译:大型语言模型(LLM)在多项选择任务上的表现,在符号化评估格式与完形填空式评估格式之间存在显著差异。观察到的差异可系统性地归因于任务特性:自然语言延续任务受益于似然度评分,而显式比较任务则更适合符号化选择。这一趋势在不同基于解码器的LLM中保持一致,表明其具有模型无关性。为解决此类不一致性,本文提出一种动态格式对齐策略,该策略采用基于潜在模型偏好信号训练的轻量级分类器。与通常导致性能下降的人工设计启发式方法不同,本方法利用模型生成的信号为每个问题实例确定最优格式。所提方法在推理与知识基准测试的零样本准确率上实现了显著且一致的提升,更好地揭示了模型的潜在能力。