Type-Error Ablation and AI Coding Agents

Programming language implementors have designed error messages with one consumer in mind: the human programmer. Human-factors research has consistently found that programmers engage with error messages poorly -- they skim, miss key information, and are easily overwhelmed. The practical consequence has been a strong design pressure toward brevity: messages should be terse enough that programmers will actually read them. AI coding agents are now a second, fundamentally different consumer of error messages. Unlike humans, agents do not tire, lose attention, or find length cognitively overwhelming. This raises a question the programming-language community has not previously had reason to ask: should error-message detail be calibrated differently for AI agents than for humans? We investigate this question through a controlled experiment using Shplait, an ML-style statically typed language. We construct a suite of programs containing a single deliberate type error each, and measure how often an AI agent repairs them under ablation: a detailed error context using the unification stack; a proximate error location; a minimal type error; and a dynamic (test suite) error only. An automated oracle uses a test suite to classify each repair attempt as a type error, semantically incorrect, or semantically correct. We find concrete evidence that more detailed error messages improve an agent's ability to fix type errors. We also find that the presence of a type system appears to help more than only test suite failure reports. As a secondary finding, in cases where an agent successfully fixes the type error, the resulting program passes all semantic tests most of the time -- lending empirical support to a widely held folk belief about typed languages. We also see evidence that leading agents are able to correctly reconstruct the meaning of programs in which all names have been obfuscated.

翻译：编程语言实现者长期以来设计错误信息时只考虑一个消费者：人类程序员。人因研究一致发现，程序员处理错误信息的能力较差——他们倾向于快速浏览、遗漏关键信息，且容易被信息淹没。这带来的实际影响是产生了强烈的设计倾向：错误信息应足够简洁，以便程序员愿意阅读。如今，AI编码智能体成为第二类根本不同的错误信息消费者。与人类不同，智能体不会疲劳、不会失去注意力，也不会因信息冗长而产生认知负担。这提出了编程语言社区此前未曾需要思考的问题：错误信息的详细程度是否应针对AI智能体与人类进行差异化校准？我们通过使用Shplait（一种ML风格静态类型语言）的受控实验来探究这一问题。我们构建了一套程序集，每个程序包含一个刻意植入的类型错误，并测量AI智能体在四种消融条件下的修复成功率：基于合一栈的详细错误上下文、邻近错误位置、最小类型错误、以及仅动态测试套件错误。一个自动化预言机使用测试套件将每次修复尝试分类为类型错误、语义不正确或语义正确。我们发现具体证据表明，更详细的错误信息能提升智能体修复类型错误的能力。我们还发现类型系统的存在比仅依靠测试套件失败报告更有帮助。作为次要发现，在智能体成功修复类型错误的案例中，最终程序大部分情况下能通过所有语义测试——这为关于类型语言的广泛民间信念提供了实证支持。我们还观察到，前沿智能体能够正确重构所有名称已被混淆的程序含义。