Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters. Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely VirDA. Instead of fine-tuning the full backbone, VirDA prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its "style" to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains. We evaluate VirDA on Office-31 and obtain 92.8% mean accuracy with only 1.5M trainable parameters. VirDA surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6% accuracy while using just 46% of its parameters. Compared with full-backbone fine-tuning, VirDA outperforms CDTrans and FixBi by +0.2% and +1.4%, respectively, while requiring only 1.7% and 2.8% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), VirDA uses ~1.7% of their parameters and trades off only 2.2% and 1.1% accuracy, respectively.
翻译:现有的无监督域自适应(UDA)流程通常针对每一对新的源域-目标域对已经训练良好的骨干网络参数进行微调,这导致可训练参数量和存储内存随新域对数量线性增长,同时也阻碍了这些训练良好的骨干网络参数的复用。受现有骨干网络具有纹理偏置这一近期发现的启发,我们提出通过视觉重编程利用领域特定的纹理偏置进行域自适应,即VirDA。VirDA并非微调整个骨干网络,而是在骨干网络前添加一个领域特定的视觉重编程层。该层生成视觉提示,作为输入图像的附加纹理偏置,从而将其“风格”适应到目标域。为了优化这些视觉重编程层,我们采用多个目标函数,在应用域自适应视觉提示时优化域内和域间分布差异。此过程无需修改骨干网络参数,使得同一骨干网络能够在不同域间重复使用。我们在Office-31数据集上评估VirDA,仅使用150万个可训练参数即获得92.8%的平均准确率。VirDA以仅46%的参数用量,超越了当前最高效的参数高效型UDA基线方法PDA,准确率提升+1.6%。与全骨干网络微调相比,VirDA分别以仅1.7%和2.8%的可训练参数量,在准确率上超过CDTrans和FixBi +0.2%和+1.4%。相较于当前最强方法(PMTrans和TVT),VirDA使用约1.7%的参数,仅以准确率下降2.2%和1.1%为代价。