Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.
翻译:大型语言模型(LLMs)催生了“氛围编码”(vibe coding)模式,用户借助LLMs通过自然语言交互生成并迭代优化代码,直至通过其“氛围检查”(vibe check)。氛围检查反映了人类偏好,超越了功能性需求:解决方案应感觉恰当、代码清晰、保持意图且正确无误。然而,当前代码评估仍局限于pass@k指标,仅捕捉功能正确性,忽略了用户常规使用的非功能性指令。本文假设:在功能正确性之外,指令遵循是氛围检查所缺失的关键环节。为量化模型遵循代码指令的能力并生成可测量信号,我们提出VeriCode——一个包含30条可验证代码指令的分类体系及其确定性验证器。基于该分类体系,我们对现有评估套件进行扩展,构建了SWE-IF测试平台,用于综合评估指令遵循与功能正确性。通过评估31个LLMs,我们发现即便最强模型也难以同时满足多条指令,并表现出功能回归现象。最重要的是,功能正确性与指令遵循的复合分数与人类偏好相关性最高,其中指令遵循成为区分不同LLMs的首要因素。我们的代码、数据及分类体系已开源:https://github.com/maszhongming/SWE-IF。