SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

from arxiv, Code is available at https://github.com/MAC-AutoML/SocialOmni and dataset is available at https://huggingface.co/datasets/alexisty/SocialOmni

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

翻译：全模态大语言模型通过原生整合音频、视觉与文本，重新定义了人机交互。然而，现有全模态大语言模型基准仍局限于静态的、以准确性为中心的任务，在评估社交互动性——即处理自然对话中动态线索的基本能力——方面存在关键空白。为此，我们提出SocialOmni，一个全面的基准，旨在从三个核心维度对会话互动性进行评估操作化：(i) 说话人分离与识别（谁在说话），(ii) 打断时机控制（何时插话），以及(iii) 自然打断生成（如何措辞打断）。SocialOmni包含2000个感知样本和一个经过质量控制的诊断集，该诊断集包含209个具有严格时间和上下文约束的交互生成实例，并辅以受控的视听不一致场景以测试模型鲁棒性。我们对12个领先的全模态大语言模型进行了基准测试，结果揭示了它们在社交互动能力方面存在显著差异。此外，我们的分析表明，模型的感知准确性与其生成上下文恰当打断的能力之间存在明显的解耦，这表明仅靠以理解为中心的度量标准不足以表征会话社交能力。更令人鼓舞的是，来自SocialOmni的这些诊断结果为弥合未来全模态大语言模型中感知与互动之间的鸿沟提供了可操作的信号。