Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
翻译:近期的大语言模型在遵循用户指令方面展现出显著成功,但处理包含多重约束的指令仍是一项重大挑战。本文提出WildIFEval——一个包含7000条真实用户指令的大规模数据集,其指令具有多样化的多约束条件。与现有数据集不同,本数据集涵盖从自然用户指令中提取的广泛词汇和主题维度的约束类型。我们将这些约束归纳为八类高层次类别,以捕捉其在真实场景中的分布与动态变化。基于WildIFEval,我们开展大量实验以评估主流大语言模型的指令遵循能力。WildIFEval能清晰区分小型与大型模型,并表明所有模型在此类任务上仍有巨大提升空间。我们分析了约束数量与类型对性能的影响,揭示了模型遵循约束行为的有趣模式。我们公开该数据集,以推动在复杂现实条件下指令遵循研究的进一步发展。