Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and benchmark for evaluating goal-directed planning in stochastic generative environments, and ORCA (Online Reasoning and Cognitive Architecture), the first framework enabling active intelligence in video avatars. ORCA embodies Internal World Model (IWM) capabilities through two key innovations: (1) a closed-loop OTAR cycle (Observe-Think-Act-Reflect) that maintains robust state tracking under generative uncertainty by continuously verifying predicted outcomes against actual generations, and (2) a hierarchical dual-system architecture where System 2 performs strategic reasoning with state prediction while System 1 translates abstract plans into precise, model-specific action captions. By formulating avatar control as a POMDP and implementing continuous belief updating with outcome verification, ORCA enables autonomous multi-step task completion in open-domain scenarios. Extensive experiments demonstrate that ORCA significantly outperforms open-loop and non-reflective baselines in task success rate and behavioral coherence, validating our IWM-inspired design for advancing video avatar intelligence from passive animation to active, goal-oriented behavior.
翻译:当前视频化身生成方法在身份保持和运动对齐方面表现出色,但缺乏真正的自主性,无法通过自适应环境交互自主追求长期目标。为此,我们提出L-IVA(长时程交互视觉化身)——一个用于评估随机生成环境中目标导向规划能力的任务与基准,以及ORCA(在线推理与认知架构)——首个实现视频化身主动智能的框架。ORCA通过两项关键创新体现了内部世界模型能力:(1)闭环OTAR循环(观察-思考-行动-反思),通过持续将预测结果与实际生成内容进行验证,在生成不确定性下保持鲁棒的状态追踪;(2)分层双系统架构,其中系统2执行带有状态预测的战略推理,而系统1将抽象计划转化为精确的、模型特定的动作描述。通过将化身控制建模为POMDP,并实施基于结果验证的持续信念更新,ORCA能够在开放域场景中实现自主多步任务完成。大量实验表明,ORCA在任务成功率和行为一致性方面显著优于开环和非反思基线,验证了我们受IWM启发的设计能够推动视频化身智能从被动动画向主动、目标导向的行为演进。