To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle--ISO, shutter speed, and aperture--as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only TTA on single Auto-Exposure captures, by up to 25.6 percentage points (pp), and delivers up to 3.4 pp additional gains over pipelines that combine conventional sensor control with TTA. MVP remains effective under reduced parameter candidate sets that lower capture latency, demonstrating practicality. These results support the main claim that, beyond post-capture prompting, measurement-time control--selecting and combining real physical views--substantially improves robustness for VLMs.
翻译:为将视觉语言模型(VLMs)的应用从网络图像扩展至传感器介导的物理环境,我们提出了多视角物理提示测试时适应框架(MVP),这是一种前向传播框架,通过将相机曝光三角——ISO、快门速度和光圈——视为物理提示,将测试时适应(TTA)从标记层面转移至光子层面。在推理阶段,MVP为每个场景获取物理视角库,利用源亲和度分数选择前k个传感器设置,在轻量级数字增强下评估每个保留的视角,筛选出熵值最低的增强视角子集,并通过零温度softmax(即硬投票)聚合预测结果。这种先选择后投票的设计简洁、易于校准,且无需梯度计算或模型修改。在ImageNet-ES和ImageNet-ES-Diverse数据集上,MVP在单次自动曝光捕获中始终优于纯数字TTA方法,提升幅度最高达25.6个百分点(pp),并在结合传统传感器控制与TTA的流程基础上额外带来最高3.4 pp的性能增益。MVP在减少参数候选集以降低捕获延迟的情况下仍保持有效性,体现了其实用性。这些结果支持了核心主张:除了捕获后的提示,测量时的控制——即选择和组合真实物理视角——能显著提升VLMs的鲁棒性。