A vision foundation model outputs an embedding vector for an image, which can be affected by common editing operations (e.g., JPEG compression, brightness, contrast adjustments). These common perturbations alter embedding vectors and may impact the performance of downstream tasks using these embeddings. In this work, we present the first systematic study on foundation models' robustness to such perturbations. We propose three robustness metrics and formulate five desired mathematical properties for these metrics, analyzing which properties they satisfy or violate. Using these metrics, we evaluate six industry-scale foundation models (OpenAI, Meta) across nine common perturbation categories, finding them generally non-robust. We also show that common perturbations degrade downstream application performance (e.g., classification accuracy) and that robustness values can predict performance impacts. Finally, we propose a fine-tuning approach to improve robustness without sacrificing utility.
翻译:视觉基础模型为图像输出嵌入向量,而常见编辑操作(如JPEG压缩、亮度/对比度调整)会影响该向量。这类常见扰动会改变嵌入向量,可能影响基于这些嵌入的下游任务性能。本文首次系统研究了基础模型对此类扰动的鲁棒性。我们提出三个鲁棒性指标,并为其定义五个期望的数学性质,分析各指标符合或违背这些性质的情况。利用这些指标,我们评估了六个工业级基础模型(OpenAI、Meta)在九类常见扰动下的表现,发现它们普遍缺乏鲁棒性。研究同时表明,常见扰动会降低下游应用性能(如分类准确率),且鲁棒性数值可预测性能影响程度。最后,我们提出一种微调方法,在不牺牲实用性的前提下提升鲁棒性。