Despite recent progress in Multi-Modal Large Language Models (MLLMs), it remains challenging to integrate diverse tasks ranging from pixel-level perception to high-fidelity generation. Existing approaches often suffer from either restricted task extensibility or severe performance degradation due to modality interference. n this paper, we present LLMBind, an extensible framework that unifies multimodal tasks through a dual-pathway mechanism: In-Situ semantic embeddings for localization-sensitive tasks like semantic segmentation and Ex-Situ task-prompts for generation across image, video, and audio modalities. Additionally, we employ a Mixture-of-Experts (MoE) architecture to route task-specific tokens, thereby achieving modality disentanglement and mitigating negative transfer. We also curate a 400k multi-turn interactive dataset focused on iterative visual refinement to enable human-like interaction. Extensive experiments demonstrate that LLMBind achieves excellent performance across multiple perception and generation benchmarks while maintaining superior expandability.
翻译:尽管多模态大语言模型(MLLMs)近期取得了进展,但整合从像素级感知到高保真生成的多样化任务仍然具有挑战性。现有方法常因任务可扩展性受限或模态干扰导致的严重性能下降而表现不佳。本文提出LLMBind,一个通过双路径机制统一多模态任务的可扩展框架:针对语义分割等定位敏感任务采用原位语义嵌入,而针对图像、视频和音频模态的生成任务则采用异位任务提示。此外,我们采用混合专家(MoE)架构来路由任务特定令牌,从而实现模态解耦并缓解负迁移。我们还构建了一个包含40万轮次、专注于迭代视觉细化的多轮交互数据集,以实现类人交互。大量实验表明,LLMBind在多个感知与生成基准测试中均取得优异性能,同时保持了卓越的可扩展性。