Despite recent competitive performance across a range of vision tasks, vision Transformers still have an issue of heavy computational costs. Recently, vision prompt learning has provided an economic solution to this problem without fine-tuning the whole large-scale models. However, the efficiency of existing models are still far from satisfactory due to insertion of extensive prompts blocks and trick prompt designs. In this paper, we propose an efficient vision model named impLicit vIsion prOmpt tuNing (LION), which is motivated by deep implicit models with stable memory costs for various complex tasks. In particular, we merely insect two equilibrium implicit layers in two ends of the pre-trained main backbone with parameters in the backbone frozen. Moreover, we prune the parameters in these two layers according to lottery hypothesis. The performance obtained by our LION are promising on a wide range of datasets. In particular, our LION reduces up to 11.5% of training parameter numbers while obtaining higher performance compared with the state-of-the-art baseline VPT, especially under challenging scenes. Furthermore, we find that our proposed LION had a good generalization performance, making it an easy way to boost transfer learning in the future.
翻译:尽管视觉Transformer在多种视觉任务中展现出具有竞争力的性能,但其计算成本高昂的问题依然存在。最近,视觉提示学习提供了一种无需微调整个大规模模型的经济解决方案。然而,由于插入大量提示模块和复杂的提示设计,现有模型的效率仍远未达到令人满意的水平。本文提出了一种名为隐式视觉提示调优(LION)的高效视觉模型,该模型受深度隐式模型启发,能够在各种复杂任务中以稳定的存储成本运行。具体而言,我们仅在预训练主骨干网络的两端插入两个均衡隐式层,并冻结骨干网络的参数。此外,根据彩票假设,我们对这两个层的参数进行剪枝。我们的LION在广泛数据集上取得了具有竞争力的性能。特别是,与最新基线VPT相比,LION在训练参数数量上减少了高达11.5%,同时在具有挑战性的场景下获得了更高的性能。此外,我们发现所提出的LION具有良好的泛化性能,为未来提升迁移学习提供了一种简便途径。