Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification

Coverage closure is the most time-consuming phase of hardware verification, and recent large language model (LLM)-based coding agents offer a promising approach to automated stimulus generation. However, prior LLM-based flows do not systematically analyze which coverage holes remain difficult to close or how inference-time computation is allocated during agentic verification. As a result, the efficiency limits and failure modes of LLM-based coverage closure remain poorly understood, particularly for large designs. We present an empirical study using a two-tier agentic framework comprising a base Codex agent and an enhanced domain-specialized LangGraph system. Our framework enables a taxonomy of coverage holes: methodology-bound ceilings (integration tied-off hardware, infeasible boundaries, dead code) and reasoning frontiers (protocol sequencing, multi-module pipeline warm-up, narrow timing conditions), exposing fundamental limits of purely LLM-driven approaches. We further instrument the system to track token usage across six categories, including system prompt, design comprehension, stimulus generation, coverage feedback, error recovery, and agentic overhead. We show that domain specialization shifts token allocation toward coverage-directed reasoning and improves efficiency. Across designs, the enhanced system achieves comparable or higher coverage (95-99%) while using 4-13x fewer tokens and converging to coverage targets 2-4x faster than a general-purpose baseline. Our results characterize the limits of LLM-based coverage closure, inform benchmark design and human escalation strategies, and guide profile-driven agent design for hardware verification.

翻译：覆盖收敛是硬件验证中最耗时的阶段，基于大型语言模型（LLM）的最新编码智能体为自动化激励生成提供了有前景的方法。然而，以往的LLM流程并未系统分析哪些覆盖盲点难以收敛，也未探讨推理时计算在智能体验证中的分配方式。因此，LLM驱动的覆盖收敛效率极限与失效模式仍未被充分理解，尤其在大型设计场景中。我们通过一个双层智能体框架（包含基础Codex智能体与增强型领域专用LangGraph系统）开展实证研究。该框架实现了覆盖盲点的分类体系：方法论限制性瓶颈（集成绑定硬件、不可行边界、死代码）与推理前沿（协议序列化、多模块流水线预热、窄时序条件），揭示了纯LLM驱动方法的根本局限性。我们进一步对系统进行插桩，追踪六大类令牌消耗（包括系统提示、设计理解、激励生成、覆盖反馈、错误恢复与智能体开销），证明领域专业化可将令牌分配转向覆盖导向推理，并提升效率。在各类设计中，增强型系统以通用基线4-13倍的更低令牌消耗实现同等或更高覆盖率（95-99%），且收敛至覆盖目标的速度快2-4倍。本研究量化了LLM驱动覆盖收敛的极限，为基准测试设计、人工升级策略制定及硬件验证中基于性能画像的智能体框架优化提供依据。