Binary analysis increasingly relies on large language models (LLMs) to perform semantic reasoning over complex program behaviors. However, existing approaches largely adopt a one-pass execution paradigm, where reasoning operates over a fixed program representation constructed by static analysis tools. This formulation limits the ability to adapt exploration based on intermediate results and makes it difficult to sustain long-horizon, multi-path analysis under constrained context. We present FORGE, a system that rethinks LLM-based analysis as a feedback-driven execution process. FORGE interleaves reasoning and tool interaction through a reasoning-action-observation loop, enabling incremental exploration and evidence construction. To address the instability of long-horizon reasoning, we introduce a Dynamic Forest of Agents (FoA), a decomposed execution model that dynamically coordinates parallel exploration while bounding per-agent context. We evaluate FORGE on 3,457 real-world firmware binaries. FORGE identifies 1,274 vulnerabilities across 591 unique binaries, achieving 72.3% precision while covering a broader range of vulnerability types than prior approaches. These results demonstrate that structuring LLM-based analysis as a decomposed, feedback-driven execution system enables both scalable reasoning and high-quality outcomes in long-horizon tasks.
翻译:二进制分析日益依赖大型语言模型来对复杂程序行为进行语义推理。然而,现有方法大多采用一次性执行范式,其推理过程建立在由静态分析工具构建的固定程序表示之上。这种范式限制了基于中间结果动态调整探索路径的能力,并且在有限上下文条件下难以支持长程、多路径分析。我们提出FORGE系统,将基于LLM的分析重构为反馈驱动的执行过程。FORGE通过"推理-行动-观察"循环交错进行推理与工具交互,实现增量式探索与证据构建。为应对长程推理的不稳定性,我们引入动态智能体森林——一种分解式执行模型,在限制单智能体上下文的同时动态协调并行探索。我们在3,457个真实固件二进制文件上评估了FORGE。FORGE在591个独立二进制文件中识别出1,274个漏洞,达到72.3%的准确率,且覆盖的漏洞类型范围超越先前方法。这些结果表明,将基于LLM的分析构建为分解式、反馈驱动的执行系统,能够在长程任务中同时实现可扩展推理与高质量结果。