Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.
翻译:有效的现实世界辅助需要具备强大心智理论(ToM)的AI智能体:即从人类行为中推断其心理状态的能力。尽管近期有所进展,但仍面临若干关键挑战,包括(1)对多种假设进行鲁棒不确定性更新的在线推理;(2)适用于实时辅助的高效推理;以及(3)现实领域中心智状态真实标注的缺失。我们通过引入MindZero应对这些挑战,这是一种自监督强化学习框架,用于训练多模态大语言模型(MLLMs)以实现高效且鲁棒的在线心智推理。训练过程中,模型因生成的心智状态假设能使规划器估计的观察动作似然最大化而获得奖励——这与基于模型的ToM推理类似。该方法因此消除了对显式心智状态标注的需求。训练完成后,MindZero将基于模型的推理内化为快速的单次前向推理。我们在网格世界和家庭领域的多项挑战性心智推理与AI辅助任务中,将MindZero与基线方法进行了对比评估。我们发现,仅靠LLM并不足够;基于模型的方法虽能提升准确率,但速度慢、成本高且受限于骨干MLLM的容量。相比之下,MindZero增强了MLLMs的固有ToM能力,在准确率和效率上均显著优于基于模型的方法,表明心智推理可作为一种自监督技能被有效习得。