Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose ARTIS, Agentic Risk-Aware Test-Time Scaling via Iterative Simulation, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a risk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.
翻译:当前的测试时扩展技术通过在推理时分配额外计算资源来提升大语言模型性能,但这些方法在具身智能场景中仍显不足——此类场景中智能体的行为会直接与外部环境交互,且其影响可能不可逆转并伴随高昂代价。我们提出ARTIS框架(基于迭代模拟的具身风险感知测试时扩展),该框架通过在执行真实操作前进行模拟交互来实现测试时探索,从而将探索过程与执行决策解耦。该设计可在不引发环境风险的前提下,通过扩展推理时计算来提升行为级可靠性与鲁棒性。我们进一步发现,基于大语言模型的简易模拟器难以捕捉罕见但影响重大的故障模式,这严重限制了其在具身决策中的有效性。为解决此局限,我们引入风险感知工具模拟器,该模拟器通过定向数据生成与重平衡训练,着重提升对诱发故障行为的模拟保真度。在多轮多步具身基准测试上的实验表明:迭代模拟能显著提升智能体可靠性,而风险感知模拟对于在不同模型与任务中持续实现这些性能增益至关重要。