We introduce and explore a new multimodal input representation for vision-language models: acoustic field video. Unlike conventional video (RGB with stereo/mono audio), our video stream provides a spatially grounded visualization of sound intensity across a scene, offering a new and powerful dimension of perceptual understanding. Our real-time pipeline uses low-cost beamforming microphone arrays that are already common in smart speakers and increasingly present in robotics and XR headsets, yet this sensing capability remains unutilized for scene understanding. To assess the value of spatial acoustic information, we constructed an evaluation set of 402 question-answer scenes, comparing a state-of-the-art VLM given conventional video with and without paired acoustic field video. Results show a clear and consistent improvement when incorporating spatial acoustic data; the VLM we test improves from 38.3% correct to 67.4%. Our findings highlight that many everyday scene understanding tasks remain underconstrained when relying solely on visual and audio input, and that acoustic field data provides a promising and practical direction for multimodal reasoning. A video demo is available at https://daehwakim.com/seeingsound
翻译:本文提出并探索了一种面向视觉-语言模型的新型多模态输入表征:声场视频。与传统视频(RGB图像配合立体声/单声道音频)不同,我们的视频流通过空间锚定的声强可视化呈现场景中的声音分布,为感知理解提供了全新且强大的维度。我们的实时处理流程采用低成本波束成形麦克风阵列——此类设备已广泛搭载于智能音箱,并日益应用于机器人及扩展现实头显设备,然而其传感能力在场景理解任务中尚未得到有效利用。为评估空间声学信息的价值,我们构建了包含402个问答场景的评估数据集,对比了先进视觉-语言模型在接收传统视频时,结合声场视频前后的性能差异。实验结果表明,融入空间声学数据能带来明确且一致的性能提升:测试模型的准确率从38.3%提升至67.4%。我们的研究揭示:在仅依赖视觉与音频输入时,许多日常场景理解任务仍存在约束不足的问题,而声场数据为多模态推理提供了具有前景且实用的发展方向。视频演示详见 https://daehwakim.com/seeingsound