从标记到光子：视觉语言模型的测试时物理提示 (From Tokens to Photons: Test-Time Physical Prompting for Vison-Language Models)

To extend the application of vision-language models (VLMs) from web images to sensor-mediated physical environments, we propose Multi-View Physical-prompt for Test-Time Adaptation (MVP), a forward-only framework that moves test-time adaptation (TTA) from tokens to photons by treating the camera exposure triangle--ISO, shutter speed, and aperture--as physical prompts. At inference, MVP acquires a library of physical views per scene, selects the top-k sensor settings using a source-affinity score, evaluates each retained view under lightweight digital augmentations, filters the lowest-entropy subset of augmented views, and aggregates predictions with Zero-temperature softmax (i.e., hard voting). This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP consistently outperforms digital-only TTA on single Auto-Exposure captures, by up to 25.6 percentage points (pp), and delivers up to 3.4 pp additional gains over pipelines that combine conventional sensor control with TTA. MVP remains effective under reduced parameter candidate sets that lower capture latency, demonstrating practicality. These results support the main claim that, beyond post-capture prompting, measurement-time control--selecting and combining real physical views--substantially improves robustness for VLMs.

翻译：为将视觉语言模型（VLMs）的应用从网络图像扩展至传感器介导的物理环境，我们提出了多视角物理提示测试时适应框架（MVP），这是一种前向传播框架，通过将相机曝光三角——ISO、快门速度和光圈——视为物理提示，将测试时适应（TTA）从标记层面转移至光子层面。在推理阶段，MVP为每个场景获取物理视角库，利用源亲和度分数选择前k个传感器设置，在轻量级数字增强下评估每个保留的视角，筛选出熵值最低的增强视角子集，并通过零温度softmax（即硬投票）聚合预测结果。这种先选择后投票的设计简洁、易于校准，且无需梯度计算或模型修改。在ImageNet-ES和ImageNet-ES-Diverse数据集上，MVP在单次自动曝光捕获中始终优于纯数字TTA方法，提升幅度最高达25.6个百分点（pp），并在结合传统传感器控制与TTA的流程基础上额外带来最高3.4 pp的性能增益。MVP在减少参数候选集以降低捕获延迟的情况下仍保持有效性，体现了其实用性。这些结果支持了核心主张：除了捕获后的提示，测量时的控制——即选择和组合真实物理视角——能显著提升VLMs的鲁棒性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/