Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
翻译:小型视觉语言模型(VLMs)为大型模型提供了计算效率更高的替代方案,但其代价是泛化能力和下游任务性能较弱。这些不足可以通过测试时扩展技术来解决,但现有方法通常计算需求较高,与小型模型的资源高效设计目标相矛盾。为应对这些局限性,我们提出了两种新颖且高效的测试时扩展策略,它们利用模型内部特征而非外部监督:(i)测试时增强(TTAug),该方法生成多个增强输入并在无需参数更新的情况下在词元级别聚合输出;(ii)测试时适应(TTAdapt),该方法在推理过程中利用来自TTAug的基于共识的伪标签来调整模型参数。通过在九个基准测试上进行广泛实验,我们证明了该方法在保持适用于资源受限环境的计算效率的同时,能实现持续的性能提升。我们方法的普适性在不同规模的模型内部以及不同VLM之间均得到了验证,且无需额外调参。