Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training. However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks. Consequently, generalization to unseen dense prediction tasks in the low-data regime is not straightforward and has received less attention from previous vision generalists. In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples, enabling it to serve as a data-efficient vision generalist in diverse real-world scenarios. To this end, we base our method on a powerful meta-learning framework and explore several axes to improve its performance and versatility for real-world problems, such as flexible adaptation mechanisms and scalability. We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks. Equipped with a generic architecture and an effective adaptation mechanism, our model flexibly adapts to all of these tasks with at most 50 labeled images, showcasing a significant advancement over existing data-efficient generalist approaches. Codes are available at https://github.com/GitGyun/chameleon.
翻译:大语言模型借助通用语言接口和大规模预训练,已进化出数据高效的通才模型。然而,由于不同任务的标签结构存在差异,构建面向密集视觉预测的数据高效通才模型面临独特挑战。因此,在低数据场景下泛化到未见过的密集预测任务并非易事,且此前视觉通才模型对此关注不足。本研究探索了一种通用模型,该模型能够通过少量示例灵活适应未见过的密集标签结构,从而在多样化真实场景中充当数据高效的视觉通才。为此,我们将方法建立在强大的元学习框架之上,并从多方面提升其在真实问题中的性能与通用性,例如灵活的自适应机制与可扩展性。我们在涵盖视频、三维、医学、生物及用户交互等低样本学习需求迫切的各类未见过的真实场景中评估模型。凭借通用架构与高效自适应机制,我们的模型最多仅需50张标注图像即可灵活适应所有此类任务,展现了远超现有数据高效通才方法的显著进展。代码已开源至 https://github.com/GitGyun/chameleon。