Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released at https://github.com/bytedance/SALMONN/tree/ELLSA.
翻译:人类交互本质上是多模态且全双工的:我们在聆听时保持注视,在说话时同步行动,并流畅地适应话轮转换与打断。实现这些能力对于构建模拟人类的模型至关重要。我们提出ELLSA(端到端式听、看、说与行动),据我们所知,这是首个在全双工、端到端框架内,于单一架构中同步感知与生成视觉、文本、语音及动作的模型,实现了此前难以企及的交互模式,从而产生更自然、更具人类特质的行为。其核心是一种新颖的SA-MoE架构(自注意力混合专家),该架构将各模态路由至专门化的专家模块,并通过统一的自注意力骨干网络进行融合。这为联合多模态感知与并发生成提供了可推广的解决方案,在利用强大预训练组件的同时,实现高效的模态集成并减轻模态干扰。在语音交互与机器人操控基准测试中,ELLSA匹配了各模态专用基线的性能,同时独特地支持高级多模态与全双工行为,例如对话与动作话轮转换、缺陷指令拒绝、边说边做、基于上下文的视觉问答以及动作抢占。我们认为ELLSA代表着向更自然、更通用的交互智能迈出的一步,有助于更广泛的人工通用智能探索。所有数据、代码及模型检查点将在https://github.com/bytedance/SALMONN/tree/ELLSA开源。