SeeingSounds：通过文本学习音频到视觉的对齐 (SeeingSounds: Learning Audio-to-Visual Alignment via Text)

We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., "a distant thunder") that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.

翻译：我们提出了SeeingSounds，一个轻量级且模块化的音频到图像生成框架，它利用音频、语言和视觉之间的相互作用，无需任何配对的音频-视觉数据或视觉生成模型的训练。我们的方法不将音频视为文本的替代品，也不仅仅依赖音频到文本的映射，而是执行双重对齐：音频通过一个冻结的语言编码器投影到语义语言空间，并使用一个视觉-语言模型将其上下文性地锚定到视觉领域。这种方法受认知神经科学的启发，反映了人类感知中观察到的自然跨模态关联。该模型在冻结的扩散主干网络上运行，仅训练轻量级适配器，实现了高效且可扩展的学习。此外，它通过程序化文本提示生成支持细粒度和可解释的控制，其中音频变换（例如，音量或音高变化）被转化为描述性提示（例如，“远处的雷声”）来指导视觉输出。在标准基准上的大量实验证实，SeeingSounds在零样本和有监督设置下均优于现有方法，在可控的音频到视觉生成领域确立了新的技术水平。