While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.
翻译:尽管多模态大语言模型(LLMs)在对话任务中表现出色,但它们是否能够充分解析对话的结构——包括对话角色与线程关系——仍未得到充分探索。本研究提出了一套任务体系,并发布了TV-MMPC这一新标注数据集,用于多模态对话结构理解。评估结果表明,虽然所有多模态LLMs均优于我们的启发式基线模型,但即使表现最佳的模型在对话角色身份被匿名化时,其性能仍会出现显著下降。除评估外,我们对TV-QA中的350,842条话语进行了社会语言学分析。研究发现,虽然女性角色发起对话的比例与其发言时长成正比,但她们被设定为受话者或旁听者的可能性是男性角色的1.2倍,且旁听者的存在会使对话语域从个人性转向社会性。