Deploying Vision-Language Models (VLMs) on edge devices is challenged by resource constraints and performance degradation under distribution shifts. While test-time adaptation (TTA) can counteract such shifts, existing methods are too resource-intensive for on-device deployment. To address this challenge, we propose LQA, a lightweight, quantized-adaptive framework for VLMs that combines a modality-aware quantization strategy with gradient-free test-time adaptation. We introduce Selective Hybrid Quantization (SHQ) and a quantized, gradient-free adaptation mechanism to enable robust and efficient VLM deployment on resource-constrained hardware. Experiments across both synthetic and real-world distribution shifts show that LQA improves overall adaptation performance by 4.5\%, uses less memory than full-precision models, and significantly outperforms gradient-based TTA methods, achieving up to 19.9$\times$ lower memory usage across seven open-source datasets. These results demonstrate that LQA offers a practical pathway for robust, privacy-preserving, and efficient VLM deployment on edge devices.
翻译:在边缘设备上部署视觉语言模型(VLM)面临着资源受限和分布偏移下性能下降的挑战。虽然测试时自适应(TTA)可以应对此类偏移,但现有方法对于设备端部署而言资源消耗过大。为应对这一挑战,我们提出了LQA,一个面向VLM的轻量级量化自适应框架,它将模态感知的量化策略与无梯度的测试时自适应相结合。我们引入了选择性混合量化(SHQ)以及一种量化的、无梯度的自适应机制,以实现资源受限硬件上稳健且高效的VLM部署。在合成与真实世界分布偏移上的实验表明,LQA将整体自适应性能提升了4.5\%,比全精度模型占用更少内存,并显著优于基于梯度的TTA方法,在七个开源数据集上实现了高达19.9$\times$的内存使用降低。这些结果表明,LQA为在边缘设备上实现稳健、隐私保护且高效的VLM部署提供了一条实用路径。