We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as task input/output format, TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks.
翻译:我们提出了一种视觉语言模型,其参数在所有任务上联合训练,并在可能相互干扰的多个异构任务之间完全共享,从而得到一个单一模型,我们将其命名为Musketeer。异构任务之间知识的整合通过一种名为任务解释提示(TEP)的新特性实现。借助任务输入/输出格式等丰富且结构化的信息,TEP减少了任务间的干扰,使模型能够专注于其共享结构。通过单一模型,Musketeer在多个任务上几乎一致地取得了与单任务训练的强大基线模型相当或更优的结果。