Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
翻译:当前的多模态大语言模型将音频作为单声道流进行处理,忽略了对于具身智能至关重要的丰富空间信息。相反,现有的空间音频模型受限于固定的麦克风几何结构,无法部署于多样化的设备之上。我们提出了PhaseCoder,一种仅使用Transformer架构且对麦克风几何结构无关的空间音频编码器。PhaseCoder以原始多通道音频和麦克风坐标作为输入来执行定位,并生成鲁棒的空间嵌入。我们证明了Gemma 3n大语言模型可以通过微调,对PhaseCoder产生的"空间音频令牌"进行推理。我们的编码器在麦克风不变定位基准测试中取得了最先进的结果,并且首次使得大语言模型能够基于任意麦克风阵列执行复杂的空间推理和定向转录任务。