From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition

People's social relationships are often manifested through their surroundings, with certain objects or interactions acting as symbols for specific relationships, e.g., wedding rings, roses, hugs, or holding hands. This brings unique challenges to recognizing social relationships, requiring understanding and capturing the essence of these contexts from visual appearances. However, current methods of social relationship understanding rely on the basic classification paradigm of detected persons and objects, which fails to understand the comprehensive context and often overlooks decisive social factors, especially subtle visual cues. To highlight the social-aware context and intricate details, we propose a novel approach that recognizes \textbf{Con}textual \textbf{So}cial \textbf{R}elationships (\textbf{ConSoR}) from a social cognitive perspective. Specifically, to incorporate social-aware semantics, we build a lightweight adapter upon the frozen CLIP to learn social concepts via our novel multi-modal side adapter tuning mechanism. Further, we construct social-aware descriptive language prompts (e.g., scene, activity, objects, emotions) with social relationships for each image, and then compel ConSoR to concentrate more intensively on the decisive visual social factors via visual-linguistic contrasting. Impressively, ConSoR outperforms previous methods with a 12.2\% gain on the People-in-Social-Context (PISC) dataset and a 9.8\% increase on the People-in-Photo-Album (PIPA) benchmark. Furthermore, we observe that ConSoR excels at finding critical visual evidence to reveal social relationships.

翻译：人们的社交关系往往通过其周围环境得以体现，特定物体或互动行为可作为特定关系的象征符号，例如婚戒、玫瑰、拥抱或牵手。这为社交关系识别带来了独特挑战，需要从视觉表象中理解并捕捉这些情境的本质。然而，当前社交关系理解方法依赖于对检测到的人物和物体进行基础分类的范式，这种方法无法理解综合情境，且常常忽略决定性的社会因素，尤其是细微的视觉线索。为突出社会感知情境与复杂细节，我们提出一种从社会认知视角识别**上下文社交关系**的新方法。具体而言，为融入社会感知语义，我们在冻结的CLIP模型上构建轻量级适配器，通过新颖的多模态侧边适配器调优机制学习社会概念。进一步，我们为每张图像构建包含社交关系的社会感知描述性语言提示（如场景、活动、物体、情感），继而通过视觉-语言对比迫使ConSoR更聚焦于决定性的视觉社会因素。值得注意的是，ConSoR在People-in-Social-Context数据集上以12.2%的性能提升超越先前方法，在People-in-Photo-Album基准测试中实现9.8%的性能增长。此外，我们观察到ConSoR擅长发现揭示社交关系的关键视觉证据。