Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
翻译:多模态大语言模型(MLLMs)已在多种客观多模态感知任务中展现出卓越性能,然而其在主观且情感细腻的领域(如心理分析)中的应用仍很大程度上未被探索。本文介绍了PICK,一个专为心理分析图像理解设计的多步骤框架,该框架通过层次化分析和知识注入与MLLMs结合,特别聚焦于临床实践中广泛使用的心理评估工具——房树人(HTP)测试。首先,我们将包含多个实例的绘画分解为具有语义意义的子绘画,构建一个能捕捉三个层次(单对象层次、多对象层次和整体层次)空间结构与内容的层次化表征。接着,我们在每个层次上对这些子绘画进行针对性分析,从其视觉线索中提取心理或情感洞察。我们还引入了一个HTP知识库,并设计了一个通过强化学习训练的特征提取模块,用于生成单对象层次分析的心理画像。该画像捕捉了整体风格特征和动态的对象特定特征(如房屋、树木或人物的特征),并将其与心理状态相关联。最后,我们整合这些多方面的信息,以产生一个与专家级推理相符的、信息充分的评估。我们的方法弥合了MLLMs与专业专家领域之间的鸿沟,为通过视觉表达理解人类心理状态提供了一个结构化且可解释的框架。实验结果表明,所提出的PICK框架显著增强了MLLMs在心理分析方面的能力。通过将其扩展到情感理解任务,该框架进一步被验证为一个通用框架。