We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.
翻译:我们提出FocusCLIP,将主体级引导——一种针对目标特定监督的专门机制——整合到CLIP框架中,以提升人类中心任务的零样本迁移性能。我们的创新贡献从视觉和文本两方面增强了CLIP。在视觉方面,我们引入模拟人类视觉注意机制的ROI热图,以强调与主体相关的图像区域。在文本方面,我们引入人体姿态描述以提供丰富的上下文信息。针对人类中心任务,FocusCLIP使用MPII人体姿态数据集中的图像进行训练。该方法在覆盖三项人类中心任务的五个未见数据集上,平均超越CLIP 8.61%。FocusCLIP的平均准确率达到33.65%,而CLIP为25.04%。我们在活动识别上观察到3.98%的提升,年龄分类上提升14.78%,情感识别上提升7.06%。此外,通过我们提出的单次LLM提示策略,我们发布了一个高质量的MPII姿态描述数据集,以鼓励人类中心任务多模态学习的进一步研究。同时,我们还证明了主体级监督在非人类中心任务上的有效性。使用CUB数据集,FocusCLIP在零样本鸟类分类上比CLIP提升2.47%。我们的发现强调了将主体级引导与通用预训练方法相结合以增强下游任务性能的潜力。