Vision-language models (VLMs) have revolutionized machine learning by leveraging large pre-trained models to tackle various downstream tasks. Despite improvements in label, training, and data efficiency, many state-of-the-art VLMs still require task-specific hyperparameter tuning and fail to fully exploit test samples. To overcome these challenges, we propose a graph-based approach for label-efficient adaptation and inference. Our method dynamically constructs a graph over text prompts, few-shot examples, and test samples, using label propagation for inference without task-specific tuning. Unlike existing zero-shot label propagation techniques, our approach requires no additional unlabeled support set and effectively leverages the test sample manifold through dynamic graph expansion. We further introduce a context-aware feature re-weighting mechanism to improve task adaptation accuracy. Additionally, our method supports efficient graph expansion, enabling real-time inductive inference. Extensive evaluations on downstream tasks, such as fine-grained categorization and out-of-distribution generalization, demonstrate the effectiveness of our approach.
翻译:视觉语言模型通过利用大规模预训练模型处理各种下游任务,彻底改变了机器学习领域。尽管在标签效率、训练效率和数据效率方面已有改进,但许多最先进的视觉语言模型仍需要针对特定任务进行超参数调优,且未能充分利用测试样本。为克服这些挑战,我们提出一种基于图的高效标签自适应与推理方法。该方法在文本提示、少样本示例和测试样本上动态构建图结构,通过标签传播实现无需任务特定调优的推理。与现有的零样本标签传播技术不同,本方法无需额外的未标注支持集,并通过动态图扩展有效利用测试样本流形。我们进一步引入上下文感知的特征重加权机制以提升任务自适应精度。此外,该方法支持高效的图扩展,可实现实时归纳推理。在下游任务(如细粒度分类和分布外泛化)上的大量实验验证了本方法的有效性。