LLM-powered multimodal systems are increasingly used to interpret human social behavior, yet how researchers apply the models' 'social competence' remains poorly understood. This paper presents a systematic literature review of 176 publications across different application domains (e.g., healthcare, education, and entertainment). Using a four-dimensional coding framework (application, technical, evaluative, and ethical), we find (1) frequent use of pattern recognition and information extraction from multimodal sources, but limited support for adaptive, interactive reasoning; (2) a dominant 'modality-to-text' pipeline that privileges language over rich audiovisual cues, striping away nuanced social cues; (3) evaluation practices reliant on static benchmarks, with socially grounded, human-centered assessments rare; and (4) Ethical discussions focused mainly on legal and rights-related risks (e.g., privacy), leaving societal risks (e.g., deception) overlooked--or at best acknowledged but left unaddressed. We outline a research agenda for evaluating socially competent, ethically informed, and interaction-aware multi-modal systems.
翻译:基于大型语言模型的多模态系统正日益被用于解读人类的社会行为,然而研究人员如何应用这些模型的‘社会能力’仍鲜为人知。本文对来自不同应用领域(如医疗、教育和娱乐)的176篇文献进行了系统性综述。通过采用一个四维编码框架(应用维度、技术维度、评估维度和伦理维度),我们发现:(1)普遍存在从多模态源中进行模式识别和信息提取的做法,但对适应性、交互式推理的支持有限;(2)主流的‘模态到文本’处理流程优先考虑语言信息,而忽视了丰富的视听线索,剥离了微妙的社会信号;(3)评估实践依赖于静态基准测试,基于社会情境、以人为本的评估则较为罕见;以及(4)伦理讨论主要集中在法律与权利相关风险(如隐私)上,而社会风险(如欺骗)则被忽视——或充其量被承认但未得到解决。我们为评估具备社会能力、伦理意识及交互感知的多模态系统提出了一个研究议程。