State-of-the-art computer vision tasks, like monocular depth estimation (MDE), rely heavily on large, modern Transformer-based architectures. However, their application in safety-critical domains demands reliable predictive performance and uncertainty quantification. While Bayesian neural networks provide a conceptually simple approach to serve those requirements, they suffer from the high dimensionality of the parameter space. Parameter-efficient fine-tuning (PEFT) methods, in particular low-rank adaptations (LoRA), have emerged as a popular strategy for adapting large-scale models to down-stream tasks by performing parameter inference on lower-dimensional subspaces. In this work, we investigate the suitability of PEFT methods for subspace Bayesian inference in large-scale Transformer-based vision models. We show that, indeed, combining BitFit, DiffFit, LoRA, and CoLoRA, a novel LoRA-inspired PEFT method, with Bayesian inference enables more robust and reliable predictive performance in MDE.
翻译:在计算机视觉任务中,如单目深度估计,最先进的方法严重依赖于大型的、基于Transformer的现代架构。然而,在安全关键领域的应用要求可靠的预测性能和不确定性量化。虽然贝叶斯神经网络为满足这些要求提供了一个概念上简单的方法,但它们受到参数空间高维度的困扰。参数高效微调方法,特别是低秩适应,已成为一种流行的策略,通过在低维子空间上进行参数推断,使大规模模型适应下游任务。在本工作中,我们研究了参数高效微调方法在基于Transformer的大规模视觉模型中进行子空间贝叶斯推断的适用性。我们证明,确实,将BitFit、DiffFit、LoRA以及一种新颖的受LoRA启发的参数高效微调方法CoLoRA与贝叶斯推断相结合,能够在单目深度估计中实现更稳健和可靠的预测性能。