Empowering models to dynamically accomplish tasks specified through natural language instructions represents a promising path toward more capable and general artificial intelligence. In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data. InstructSeq employs a multimodal transformer architecture encompassing visual, language, and sequential modeling. We utilize a visual encoder to extract image features and a text encoder to encode instructions. An autoregressive transformer fuses the representations and generates sequential task outputs. By training with LLM-generated natural language instructions, InstructSeq acquires a strong comprehension of free-form instructions for specifying visual tasks. This provides an intuitive interface for directing capabilities using flexible natural instructions. Without any task-specific tuning, InstructSeq achieves compelling performance on semantic segmentation, referring expression segmentation/comprehension, and image captioning. The flexible control and multi-task unification empower the model with more human-like versatility and generalizability for computer vision. The code will be released soon at https://github.com/rongyaofang/InstructSeq.
翻译:赋予模型动态完成自然语言指令所指定任务的能力,是迈向更强大、更通用人工智能的有前景路径。本文提出InstructSeq,一种指令条件的多模态建模框架,通过灵活的自然语言控制以及对视觉和文本数据的处理,统一了多种视觉任务。InstructSeq采用涵盖视觉、语言和序列建模的多模态Transformer架构。我们利用视觉编码器提取图像特征,文本编码器编码指令,自回归Transformer融合这些表征并生成序列化的任务输出。通过使用大语言模型生成的自然语言指令进行训练,InstructSeq具备了理解自由形式指令以指定视觉任务的强大能力,从而为通过灵活的自然指令引导模型能力提供直观接口。无需任何任务特定调优,InstructSeq在语义分割、指代表达分割/理解以及图像描述任务上均取得了令人信服的性能。这种灵活控制与多任务统一使模型具备了更接近人类的通用性与泛化能力,适用于计算机视觉领域。代码将在https://github.com/rongyaofang/InstructSeq上开源。