Human-AI Alignment of Multimodal Large Language Models with Speech-Language Pathologists in Parent-Child Interactions

from arxiv, This is an earlier version of the work released in May 2025. The version accepted at CHI 2026 is available as a separate preprint at arXiv:2511.04366

Joint attention is a critical marker of early social-communicative development, yet remains difficult for caregivers to assess without expert guidance. In this work, we explore how multimodal large language models (MLLMs) can be aligned with the reasoning processes of speech-language pathologists (SLPs) to support the interpretation of everyday parent-child interactions. We conducted in-depth interviews and video annotation studies with three experienced SLPs to uncover how they evaluate joint attention based on three core behavioural cues: gaze, action, and vocalisation. Using these insights, we developed a two-stage MLLM-based system that first extracts fine-grained behavioural descriptions from video segments and then judge joint attention quality using expert-aligned prompts. Our evaluation across 26 parent-child interaction videos shows that MLLMs can achieve up to 85% accuracy in perceptual cue extraction and over 75% average precision in simulating expert judgement. We further propose design guidelines for building MLLM-based behaviour observation-judgement systems that align with SLPs, emphasising the structuring of behavioural cues, the construction of exemplar libraries grounded in expert annotations, and the need to personalise system responses based on developmental stage and neurotypical or atypical presentation. This work provides structured behavioural cues derived from SLP expertise, demonstrates the feasibility of aligning SLPs observation and judgement using MLLMs, and offers practical design guidelines for building aligned systems to support parent-child interaction analysis.

翻译：共同注意是早期社会沟通发展的关键指标，但若无专家指导，照护者仍难以对其进行评估。本研究探讨了如何将多模态大语言模型与言语病理学专家的推理过程对齐，以支持对日常亲子互动的解读。我们通过对三位经验丰富的言语病理学专家进行深度访谈和视频标注研究，揭示了他们如何依据注视、动作和发声这三个核心行为线索来评估共同注意。基于这些洞见，我们开发了一个两阶段的多模态大语言模型系统：首先从视频片段中提取细粒度的行为描述，然后使用与专家对齐的提示词来评判共同注意的质量。在26段亲子互动视频上的评估表明，多模态大语言模型在感知线索提取方面可达85%的准确率，在模拟专家判断方面平均精度超过75%。我们进一步提出了构建与言语病理学专家对齐的多模态大语言模型行为观察-评判系统的设计指南，强调行为线索的结构化、基于专家标注构建范例库的必要性，以及根据发展阶段和神经典型或非典型表现来个性化系统回应的需求。本研究提供了源自言语病理学专家知识的结构化行为线索，论证了使用多模态大语言模型对齐专家观察与判断的可行性，并为构建支持亲子互动分析的对齐系统提供了实用的设计指导。