Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging approach applies a high cosine similarity threshold and a winner-takes-all layer selection strategy, aligning attention to the HPE task while preserving original object detection knowledge. It successfully resolves issues with blended invalid response formats and improves accuracy. Results show that our HPE-CogVLM achieves a 31.5\% reduction in Mean Absolute Error over the current state-of-the-art CNN model, 6DRepNet, in cross-dataset evaluation. Furthermore, HPE-CogVLM outperforms both directly LoRA fine-tuned and task arithmetic-based merged VLMs across all HPE metrics.
翻译:头部姿态估计需要复杂的3D空间关系理解以生成精确的偏航、俯仰和滚转角。以往的HPE模型主要基于CNN,依赖裁剪后的特写人头图像作为输入,在现实场景中往往缺乏鲁棒性。视觉语言模型能够分析完整图像,同时通过其注意力机制聚焦特定对象。本文提出一种新颖框架,通过利用VLM(具体指CogVLM)的对象检测定位能力来提升HPE精度。我们通过实验发现,直接对该VLM进行LoRA微调以执行HPE任务无法达到理想的精度,而某些模型融合方法虽能提升精度,却常产生混合无效的响应格式,难以同时处理对象检测与HPE任务。为将HPE能力有效整合至CogVLM,我们开发了一种基于LoRA层的新型模型融合方法。该融合方法采用高余弦相似度阈值与赢家通吃层选择策略,在保持原始对象检测知识的同时将注意力对齐至HPE任务,成功解决了混合无效响应格式问题并提升了精度。结果表明,在跨数据集评估中,我们的HPE-CogVLM相较于当前最先进的CNN模型6DRepNet实现了31.5%的平均绝对误差降低。此外,HPE-CogVLM在所有HPE指标上均优于直接LoRA微调与基于任务算术融合的VLM。