While multimodal large language models (MLLMs) are increasingly applied in human-centred AI systems, their ability to understand complex social interactions remains uncertain. We present an exploratory study on aligning MLLMs with speech-language pathologists (SLPs) in analysing joint attention in parent-child interactions, a key construct in early social-communicative development. Drawing on interviews and video annotations with three SLPs, we characterise how observational cues of gaze, action, and vocalisation inform their reasoning processes. We then test whether an MLLM can approximate this workflow through a two-stage prompting approach, separating observation from judgement. Our findings reveal that alignment is more robust at the observation layer, where experts share common descriptors, than at the judgement layer, where interpretive criteria diverge. We position this work as a case-based probe into expert-AI alignment in complex social behaviour, highlighting both the feasibility and the challenges of applying MLLMs to socially situated interaction analysis.
翻译:尽管多模态大语言模型(MLLMs)日益应用于以人为中心的人工智能系统,但其理解复杂社会互动的能力仍不确定。本研究针对MLLMs与言语语言病理学家(SLPs)在分析亲子互动中共同注意力(早期社会沟通发展的关键构念)的对齐问题,开展了一项探索性研究。基于对三位SLPs的访谈及视频标注分析,我们刻画了注视、动作和发声等观察性线索如何影响其推理过程。随后,我们通过分离观察与判断的两阶段提示方法,测试了MLLM能否近似复现此工作流程。研究发现,对齐性在观察层面(专家使用共同描述符)比在判断层面(解释标准存在分歧)更为稳健。本研究定位为复杂社会行为中专家与人工智能对齐的案例性探索,既揭示了MLLMs应用于社会情境互动分析的可行性,也凸显了其面临的挑战。