We introduce HERO'S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluating state-of-the-art LLMs, we find that models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, whereas surface semantics has minimal effect. Induction-specific steering methods improve performance on attribute tasks but show no reliable gains on procedural tasks, suggesting the gap in procedural induction remains an open challenge.
翻译:我们提出英雄之旅(HERO'S JOURNEY),这是一个面向目标导向的阶段性任务的规则归纳基准测试,其中智能体需从演示中推断隐藏规则,并通过多步执行对规则进行应用。英雄之旅涵盖属性归纳与过程归纳两大任务族,共包含八项任务,每项任务具有四种结构化规则形式、可控词汇基础与可辨识条件。通过对最先进的大语言模型进行评估,我们发现模型虽展现出一定的规则归纳能力,但这种能力有限且在不同任务间表现不均。同时,过程执行为模型带来执行瓶颈,而表面语义影响甚微。针对归纳的引导方法虽可提升属性任务性能,但在过程任务上未见稳定改进,表明过程归纳的能力缺口仍是一个待解决的开放挑战。