Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.

翻译：在代理流水线中部署的学习型分类器面临一个根本性的可靠性问题：预测是概率推断而非经过验证的结论，若缺乏可观测证据的接地支撑而直接基于这些预测采取行动，会导致下游各阶段的复合性失败。软件漏洞分析使这一代价变得具体且可量化。我们通过一个统一的跨语言漏洞生命周期框架来解决该问题，该框架围绕三个大语言模型驱动的推理阶段构建——混合结构语义检测、执行接地的代理验证，以及验证感知的迭代修复——并受到严格不变量的约束：未获得基于执行的可利用性确认之前，不得采取任何修复操作。跨语言泛化通过通用抽象语法树（uAST）实现，该语法树将Java、Python和C++归一化为共享的结构化模式，并结合了GraphSAGE与Qwen2.5-Coder-1.5B嵌入通过学习的双向门控机制实现的混合融合，其逐样本权重在无需额外成本的情况下提供了内在的可解释性。该框架实现了89.84%-92.02%的语内检测准确率，零样本跨语言F1分数达到74.43%-80.12%，以12.27%的总失败率端到端解决了69.74%的漏洞。消融实验确立了关键组件的必要性：移除uAST导致跨语言F1下降23.42%，而禁用验证机制则使不必要的修复操作增加131.7%。这些结果表明，执行接地的闭环推理是一种原则性且可实际部署的机制，能够支撑可信的大语言模型驱动代理式人工智能。