Tremendous progress in visual scene generation now turns a single image into an explorable 3D world, yet immersion remains incomplete without sound. We introduce Image2AVScene, the task of generating a 3D audio-visual scene from a single image, and present SonoWorld, the first framework to tackle this challenge. From one image, our pipeline outpaints a 360° panorama, lifts it into a navigable 3D scene, places language-guided sound anchors, and renders ambisonics for point, areal, and ambient sources, yielding spatial audio aligned with scene geometry and semantics. Quantitative evaluations on a newly curated real-world dataset and a controlled user study confirm the effectiveness of our approach. Beyond free-viewpoint audio-visual rendering, we also demonstrate applications to one-shot acoustic learning and audio-visual spatial source separation. Project website: https://humathe.github.io/sonoworld/
翻译:视觉场景生成领域的巨大进步如今可将单张图像转变为可探索的三维世界,然而缺少声音的沉浸感仍不完整。我们提出Image2AVScene任务,即从单张图像生成三维视听场景,并介绍首个应对该挑战的框架SonoWorld。该流程从单张图像出发,外推生成360°全景图,将其提升为可导航的三维场景,放置语言引导的声音锚点,并对点源、面源及环境声源渲染全景声,从而生成与场景几何及语义对齐的空间音频。基于新构建的真实世界数据集进行的定量评估与控制性用户研究验证了本方法的有效性。除自由视角视听渲染外,我们还展示了其在一次性声学学习及视听空间声源分离中的应用。项目网站:https://humathe.github.io/sonoworld/