Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.
翻译:尽管视觉Transformer在多种视觉任务中展现出极具竞争力的性能,但其计算成本高昂的问题依然存在。最近,视觉提示学习为这一问题提供了一种经济高效的解决方案,无需微调整个大规模模型。然而,由于插入了大量提示模块并设计了复杂的提示策略,现有模型的效率仍远未令人满意。本文提出了一种高效的视觉模型,名为隐式视觉提示微调(LION),其灵感来源于具有稳定内存成本的深度隐式模型,可处理各种复杂任务。具体而言,我们仅需在预训练主骨干的两端插入两个均衡隐式层,同时冻结骨干网络的参数。此外,我们根据彩票假说对这些层的参数进行剪枝。我们的LION在多个数据集上取得了令人瞩目的性能。特别是,与最先进的基线VPT相比,LION在减少高达11.5%的训练参数数量的同时,在具有挑战性的场景下取得了更高性能。此外,我们提出的LION具有良好的泛化能力,为未来提升迁移学习提供了一条便捷途径。