End-to-end Listen, Look, Speak and Act

Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released at https://github.com/bytedance/SALMONN/tree/ELLSA.

翻译：人类交互本质上是多模态且全双工的：我们在聆听时保持注视，在说话时同步行动，并流畅地适应话轮转换与打断。实现这些能力对于构建模拟人类的模型至关重要。我们提出ELLSA（端到端式听、看、说与行动），据我们所知，这是首个在全双工、端到端框架内，于单一架构中同步感知与生成视觉、文本、语音及动作的模型，实现了此前难以企及的交互模式，从而产生更自然、更具人类特质的行为。其核心是一种新颖的SA-MoE架构（自注意力混合专家），该架构将各模态路由至专门化的专家模块，并通过统一的自注意力骨干网络进行融合。这为联合多模态感知与并发生成提供了可推广的解决方案，在利用强大预训练组件的同时，实现高效的模态集成并减轻模态干扰。在语音交互与机器人操控基准测试中，ELLSA匹配了各模态专用基线的性能，同时独特地支持高级多模态与全双工行为，例如对话与动作话轮转换、缺陷指令拒绝、边说边做、基于上下文的视觉问答以及动作抢占。我们认为ELLSA代表着向更自然、更通用的交互智能迈出的一步，有助于更广泛的人工通用智能探索。所有数据、代码及模型检查点将在https://github.com/bytedance/SALMONN/tree/ELLSA开源。