Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.
翻译:尽管视觉Transformer在一系列视觉任务中展现出竞争力的性能,但其仍面临计算成本高昂的问题。近期,视觉提示学习提供了一种无需微调整个大规模模型的经济解决方案。然而,由于插入了大量提示块和复杂的提示设计,现有模型的效率仍远未令人满意。本文提出了一种名为隐式视觉提示微调(LION)的高效视觉模型,其灵感源于深度隐式模型对各类复杂任务具有稳定内存成本的特性。具体而言,我们仅在预训练主骨干网络的两端插入两个平衡隐式层,并冻结骨干网络参数。此外,我们根据彩票假说对这些层的参数进行剪枝。LION在多个数据集上取得了优异的性能。特别地,与当前最优基线VPT相比,LION在训练参数数量减少高达11.5%的同时,尤其是在挑战性场景下获得了更高的性能。进一步研究发现,LION具有良好的泛化性能,为未来增强迁移学习提供了一种简便途径。