Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose \emph{\name}, \emph{\underline{A}gentic \underline{R}isk-Aware \underline{T}est-Time Scaling via \underline{I}terative \underline{S}imulation}, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a \emph{risk-aware tool simulator} that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.
翻译:当前测试时扩展(TTS)技术通过在推理时分配额外计算来提升大语言模型(LLM)的性能,但这些技术仍不足以应对具身智能场景,因为在该场景中,智能体的行动会直接与外部环境交互,其影响可能是不可逆且代价高昂的。我们提出 \emph{\name},即 \emph{\underline{A}gentic \underline{R}isk-Aware \underline{T}est-Time Scaling via \underline{I}terative \underline{S}imulation},该框架通过在实际执行前进行模拟交互来实现测试时探索,从而将探索与决策执行解耦。这一设计允许扩展推理时计算以提高行动层面的可靠性与鲁棒性,同时避免承担环境风险。我们进一步指出,基于LLM的简单模拟器难以捕捉罕见但影响巨大的故障模式,这严重限制了其在具身智能决策中的有效性。为解决这一局限,我们引入了一种 \emph{风险感知工具模拟器},它通过定向数据生成和再平衡训练,着重提升对诱发失败行为的模拟保真度。在多轮次、多步骤的具身智能基准测试上的实验表明,迭代模拟能显著提升智能体的可靠性,而风险感知模拟对于在不同模型和任务中持续实现这些性能增益至关重要。