Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers already compute the missing supervision signal, in the form of diagnostics, symbol resolution, type information, references, and refactoring preconditions, but expose it through interfaces designed for human-driven IDEs rather than learning loops. We introduce Reinforcement Learning from Compiler and Language Server Feedback (RLCSF) together with Lanser-CLI, a CLI-first orchestration layer that exposes this signal to agents and CI. RLCSF treats each tool interaction as a transition and computes a shaped process reward from deterministic changes in diagnostics, selector confidence, and edit safety. Lanser-CLI, in turn, converts ephemeral LSP sessions into replayable Analysis Bundles with pinned environment metadata and stable content hashes. Its core mechanisms are robust selectors that go beyond file:line:col, deterministic bundle normalization, preview-first guarded mutations, and a reward functional whose potential-based component is replayable under frozen snapshots. We formalize determinism for canonical bundles and prove that componentwise-improving transitions receive non-negative reward in the undiscounted setting. Together, these pieces yield a practical substrate for process supervision of coding agents.
翻译:编程智能体在文本层面的猜测超越程序事实时会失败:它们会虚构API、漂移至错误符号,并在无工作区有效性证据的情况下应用编辑。编译器、类型检查器及语言服务器已以诊断信息、符号解析、类型信息、引用及重构前置条件的形式计算出缺失的监督信号,但仅通过面向人类集成开发环境而非学习循环的接口暴露这些信息。我们提出编译器与语言服务器反馈驱动的强化学习(RLCSF),并配套开发Lanser-CLI——一个将此类信号暴露给智能体与持续集成的命令行优先编排层。RLCSF将每次工具交互视为一次转移,并根据诊断信息、选择器置信度及编辑安全性的确定性变化计算具有形状的过程奖励。Lanser-CLI则将瞬时的LSP会话转化为可重放的分析包,其中包含固定环境元数据与稳定内容哈希。其核心机制包括:超越file:line:col的鲁棒选择器、确定性包标准化、预览优先的防护变异,以及基于势函数的奖励组件(该组件在冻结快照下可重放)。我们形式化了规范包的确定性,并证明在无折扣设置下,逐分量改进的转移将获得非负奖励。这些组件共同为编程智能体的过程监督提供了实践基础。