While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting approach, separating observation from judgement. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.
翻译:尽管多模态大语言模型(MLLMs)越来越多地应用于以人为中心的人工智能系统,但其理解复杂社会互动的能力仍不确定。我们开展了一项探索性研究,旨在将MLLMs与言语语言病理学家(SLPs)在分析亲子互动中的共同注意力(这是早期社会沟通发展的关键构念)方面进行对齐。基于对三位SLPs的访谈和视频标注,我们描述了注视、动作和发声等观察性线索如何影响他们的推理过程。随后,我们通过一个两阶段提示方法(将观察与判断分离)测试了MLLM是否能近似这一工作流程。我们的研究结果表明,在对齐方面,观察层(专家们共享共同的描述符)比判断层(解释性标准存在分歧)更为稳健。我们将这项工作定位为对复杂社会行为中专家与人工智能对齐的案例探究,既凸显了将MLLMs应用于社会情境化互动分析的可行性,也揭示了其面临的挑战。