HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging approach applies a high cosine similarity threshold and a winner-takes-all layer selection strategy, aligning attention to the HPE task while preserving original object detection knowledge. It successfully resolves issues with blended invalid response formats and improves accuracy. Results show that our HPE-CogVLM achieves a 31.5\% reduction in Mean Absolute Error over the current state-of-the-art CNN model, 6DRepNet, in cross-dataset evaluation. Furthermore, HPE-CogVLM outperforms both directly LoRA fine-tuned and task arithmetic-based merged VLMs across all HPE metrics.

翻译：头部姿态估计需要复杂的3D空间关系理解以生成精确的偏航、俯仰和滚转角。以往的HPE模型主要基于CNN，依赖裁剪后的特写人头图像作为输入，在现实场景中往往缺乏鲁棒性。视觉语言模型能够分析完整图像，同时通过其注意力机制聚焦特定对象。本文提出一种新颖框架，通过利用VLM（具体指CogVLM）的对象检测定位能力来提升HPE精度。我们通过实验发现，直接对该VLM进行LoRA微调以执行HPE任务无法达到理想的精度，而某些模型融合方法虽能提升精度，却常产生混合无效的响应格式，难以同时处理对象检测与HPE任务。为将HPE能力有效整合至CogVLM，我们开发了一种基于LoRA层的新型模型融合方法。该融合方法采用高余弦相似度阈值与赢家通吃层选择策略，在保持原始对象检测知识的同时将注意力对齐至HPE任务，成功解决了混合无效响应格式问题并提升了精度。结果表明，在跨数据集评估中，我们的HPE-CogVLM相较于当前最先进的CNN模型6DRepNet实现了31.5%的平均绝对误差降低。此外，HPE-CogVLM在所有HPE指标上均优于直接LoRA微调与基于任务算术融合的VLM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日