Embodied decision-making is fundamental for AI agents operating in real-world environments. While Visual Language Models (VLMs) have advanced this capability, they still struggle with complex decisions, particularly in human-centered situations that require deep reasoning about human needs and values. In this study, we systematically evaluate open-sourced VLMs on multimodal human-centered decision-making tasks. We find that LLMs receiving only textual descriptions unexpectedly outperform their VLM counterparts of similar scale that process actual images, suggesting that visual alignment may hinder VLM abilities. To address this challenge, we propose a novel text-only training approach with synthesized textual data. This method strengthens VLMs' language components and transfers the learned abilities to multimodal inference, eliminating the need for expensive image-text paired data. Furthermore, we show that VLMs can achieve substantial performance gains through self-improvement, using training data generated by their LLM counterparts rather than relying on larger teacher models like GPT-4. Our findings establish a more efficient and scalable approach to enhancing VLMs' human-centered decision-making capabilities, opening new avenues for optimizing VLMs through self-improvement mechanisms.
翻译:具身决策是人工智能代理在现实环境中运行的基础能力。尽管视觉语言模型(VLMs)已推进了该能力的发展,但在处理复杂决策时仍面临挑战,尤其是在需要深入推理人类需求与价值观的人本情境中。本研究系统评估了开源VLMs在多模态人本决策任务上的表现。我们发现,仅接收文本描述的大型语言模型(LLMs)意外地优于处理真实图像的同等规模VLM,这表明视觉对齐可能限制了VLM的能力。为应对这一挑战,我们提出一种基于合成文本数据的新型纯文本训练方法。该方法强化了VLMs的语言组件,并将习得能力迁移至多模态推理,从而无需昂贵的图文配对数据。此外,我们证明VLMs可通过自我提升实现显著性能增益——使用其对应LLM生成的训练数据,而非依赖如GPT-4等大型教师模型。我们的研究为增强VLMs的人本决策能力建立了一种更高效、可扩展的路径,为通过自我提升机制优化VLMs开辟了新途径。