Agentic AI systems can now generate code with remarkable fluency, but a fundamental question remains: \emph{does the generated code actually do what the user intended?} The gap between informal natural language requirements and precise program behavior -- the \emph{intent gap} -- has always plagued software engineering, but AI-generated code amplifies it to an unprecedented scale. This article argues that \textbf{intent formalization} -- the translation of informal user intent into a set of checkable formal specifications -- is the key challenge that will determine whether AI makes software more reliable or merely more abundant. Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically. The central bottleneck is \emph{validating specifications}: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests. We survey early research that demonstrates the \emph{potential} of this approach: interactive test-driven formalization that improves program correctness, AI-generated postconditions that catch real-world bugs missed by prior methods, and end-to-end verified pipelines that produce provably correct code from informal specifications. We outline the open research challenges -- scaling beyond benchmarks, achieving compositionality over changes, metrics for validating specifications, handling rich logics, designing human-AI specification interactions -- that define a research agenda spanning AI, programming languages, formal methods, and human-computer interaction.
翻译:AI智能体系统如今能够以惊人的流畅度生成代码,但一个根本性问题依然存在:\emph{生成的代码是否真正实现了用户的意图?}非正式的自然语言需求与精确程序行为之间的鸿沟——即\emph{意图鸿沟}——一直困扰着软件工程,而AI生成的代码将其放大到了前所未有的规模。本文认为,\textbf{意图形式化}——将非正式的用户意图转化为一系列可检验的形式化规约——是决定AI将使软件变得更可靠还是仅仅变得更泛滥的关键挑战。意图形式化提供了一个适用于不同场景可靠性需求的权衡谱系:从澄清可能误解的轻量级测试,到用于形式化验证的完整功能规约,再到可自动合成正确代码的领域特定语言。其核心瓶颈在于\emph{规约验证}:由于除了用户之外不存在判断规约正确性的“神谕”,我们需要能够通过轻量级用户交互和测试等代理制品,在有代码或无代码的情况下评估规约质量的半自动化度量方法。我们综述了展示该方法\emph{潜力}的早期研究:通过交互式测试驱动形式化提升程序正确性、利用AI生成的后置条件捕获先前方法遗漏的真实世界错误,以及从非正式规约生成可证明正确代码的端到端验证流水线。我们概述了开放的研究挑战——超越基准测试的规模化、实现变更的可组合性、验证规约的度量方法、处理丰富逻辑、设计人机规约交互——这些挑战定义了一个横跨AI、编程语言、形式化方法以及人机交互的研究议程。