The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework's potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.
翻译:人类情感与心理状态的客观评估是一项重大挑战,尤其通过非语言渠道进行评估时。本文引入数字绘图作为一种丰富且尚未被充分探索的情感感知模态。我们提出了一种新颖的多模态框架,命名为ArtCognition,用于对广泛使用的心理测评工具——房树人(HTP)测试进行自动化分析。ArtCognition独特地融合了两种不同的数据流:一是来自最终画作的静态视觉特征,由计算机视觉模型捕获;二是源自绘图过程本身的动态行为运动学线索,如笔划速度、停顿和平滑度。为了弥合低级特征与高级心理解释之间的差距,我们采用了检索增强生成(RAG)架构。这使分析植根于既有的心理学知识,增强了可解释性并减少了模型产生幻觉的可能性。我们的结果表明,视觉与行为运动学线索的融合能提供比单一模态更细致入微的评估。我们展示了提取的多模态特征与标准化心理测量指标之间存在显著相关性,验证了该框架作为可扩展工具辅助临床工作者的潜力。这项工作为非侵入式情感状态评估贡献了一种新方法,并为技术辅助的心理健康护理开辟了新途径。