Existing adaptation techniques typically require architectural modifications or added parameters, leading to high computational costs and complexity. We introduce Attention Projection Layer Adaptation (APLA), a simple approach to adapt vision transformers (ViTs) without altering the architecture or adding parameters. Through a systematic analysis, we find that the layer immediately after the attention mechanism is crucial for adaptation. By updating only this projection layer, or even just a random subset of this layer's weights, APLA achieves state-of-the-art performance while reducing GPU memory usage by up to 52.63% and training time by up to 43.0%, with no extra cost at inference. Across 46 datasets covering a variety of tasks including scene classification, medical imaging, satellite imaging, and fine-grained classification, APLA consistently outperforms 17 other leading adaptation methods, including full fine-tuning, on classification, segmentation, and detection tasks. The code is available at https://github.com/MoeinSorkhei/APLA.
翻译:现有的自适应技术通常需要对架构进行修改或增加参数,导致较高的计算成本和复杂性。我们提出了注意力投影层自适应(APLA),这是一种无需改变架构或增加参数即可自适应视觉Transformer(ViT)的简单方法。通过系统分析,我们发现注意力机制之后的层对于自适应至关重要。仅更新该投影层,甚至仅更新该层权重的随机子集,APLA即可实现最先进的性能,同时将GPU内存使用量降低高达52.63%,训练时间减少高达43.0%,且在推理时无需额外成本。在涵盖场景分类、医学成像、卫星成像和细粒度分类等多种任务的46个数据集上,APLA在分类、分割和检测任务中持续优于包括全微调在内的其他17种领先自适应方法。代码发布于 https://github.com/MoeinSorkhei/APLA。