We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.
翻译:我们利用大型语言模型(LLMs)研究多模态多方对话中的话轮转换。我们针对收话人识别、话轮转换预测及下一说话人预测三个任务构建了评估框架。我们比较了为这些任务训练的监督模型、基于文本的LLMs、多模态LLMs(MM-LLMs)以及人类受试者的表现。在AMI语料库上的实验表明,尽管LLMs未在目标领域进行训练且无法获取音频或视觉信息,其在下一说话人预测任务上仍优于监督模型和人类。MM-LLM在收话人识别和话轮转换预测任务上表现优于基于文本的LLMs,但尚不及人类水平,这表明其难以有效利用原始音视频信号。消融分析显示,对话上下文至关重要,尤其是在下一说话人预测中。我们观察到人类与LLMs的预测模式相似,且两人均对换频次较高的对话片段难以准确预测。