Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
翻译:先进的大语言模型在IFEval等基准测试中已接近达到指令遵循准确性的天花板。然而,这些令人印象深刻的分数未必能转化为现实应用中的可靠服务,因为用户常常会改变措辞、语境框架和任务表述。本文研究面向细微差别的可靠性:模型在传达相似用户意图但存在细微差别的关联提示中是否表现出一致的能力。为量化这一点,我们引入了一个新指标reliable@k,并开发了一个通过数据增强生成高质量关联提示的自动化流程。在此基础上,我们构建了IFEval++进行系统评估。通过对20个专有模型和26个开源大语言模型的测试,我们发现当前模型在面向细微差别的可靠性方面存在显著不足——其性能可能因细微的提示修改而下降高达61.8%。此外,我们对其进行了特征分析,并探索了三种潜在的改进方案。我们的研究结果强调,面向细微差别的可靠性是朝着更可靠、更可信的大语言模型行为迈出的关键但尚未充分探索的下一步。我们的代码和基准测试可通过以下链接访问:https://github.com/jianshuod/IFEval-pp。