While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting approach, separating observation from judgement. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.
翻译:尽管多模态大语言模型(MLLMs)日益应用于以人为中心的人工智能系统,但其理解复杂社会互动的能力仍不确定。本研究针对MLLMs与言语语言病理学家(SLPs)在分析亲子互动中的共同注意(这是早期社会沟通发展的关键构念)时的对齐问题,开展了一项探索性研究。基于对三位SLPs的访谈和视频标注,我们刻画了视线、动作和发声等观察性线索如何影响他们的推理过程。随后,我们通过一个两阶段提示方法(将观察与判断分离)测试了MLLM能否近似这一工作流程。我们的研究结果表明,对齐在观察层面更为稳健(专家们在此层面共享通用的描述符),而在判断层面则较弱(此处的解释性标准存在分歧)。我们将这项工作定位为对复杂社会行为中专家与人工智能对齐问题的案例式探究,既揭示了应用MLLMs进行社会情境化互动分析的可行性,也凸显了其面临的挑战。