In recent years, large-scale multimodal models have demonstrated impressive capabilities across various domains. However, enabling these models to effectively perform multiple multimodal tasks simultaneously remains a significant challenge. To address this, we introduce a novel tuning method called neural tuning, designed to handle diverse multimodal tasks concurrently, including reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. Neural tuning emulates sparse distributed representation in human brain, where only specific subsets of neurons are activated for each task. Additionally, we present a new benchmark, MMUD, where each sample is annotated with multiple task labels. By applying neural tuning to pretrained large models on the MMUD benchmark, we achieve simultaneous task handling in a streamlined and efficient manner. All models, code, and datasets will be publicly available after publication, facilitating further research and development in this field.
翻译:近年来,大规模多模态模型在各个领域展现出令人瞩目的能力。然而,使这些模型能够同时有效地执行多个多模态任务仍然是一个重大挑战。为此,我们提出了一种称为神经调优的新型调优方法,旨在并行处理多种多模态任务,包括推理分割、指代分割、图像描述和文本到图像生成。神经调优模拟了人类大脑中的稀疏分布式表征,其中每个任务仅激活特定的神经元子集。此外,我们提出了一个新的基准测试集MMUD,其中每个样本都标注了多个任务标签。通过在MMUD基准测试中对预训练的大规模模型应用神经调优,我们以精简高效的方式实现了任务的并行处理。所有模型、代码和数据集将在发表后公开,以促进该领域的进一步研究与开发。