Emerging wearable devices such as smartglasses and extended reality headsets demand high-quality spatial audio capture from compact, head-worn microphone arrays. Ambisonics provides a device-agnostic spatial audio representation by mapping array signals to spherical harmonic (SH) coefficients. In practice, however, accurate encoding remains challenging. While traditional linear encoders are signal-independent and robust, they amplify low-frequency noise and suffer from high-frequency spatial aliasing. On the other hand, neural network approaches can outperform linear encoders but they often assume idealized microphones and may perform inconsistently in real-world scenarios. To leverage their complementary strengths, we introduce a residual-learning framework that refines a linear encoder with corrections from a neural network. Using measured array transfer functions from smartglasses, we compare a UNet-based encoder from the literature with a new recurrent attention model. Our analysis reveals that both neural encoders only consistently outperform the linear baseline when integrated within the residual learning framework. In the residual configuration, both neural models achieve consistent and significant improvements across all tested metrics for in-domain data and moderate gains for out-of-domain data. Yet, coherence analysis indicates that all neural encoder configurations continue to struggle with directionally accurate high-frequency encoding.
翻译:随着智能眼镜和扩展现实头显等可穿戴设备的兴起,对紧凑型头戴式麦克风阵列的高质量空间音频捕获需求日益增长。Ambisonics通过将阵列信号映射至球谐函数系数,提供了一种设备无关的空间音频表示方法。然而,实际应用中精确编码仍具挑战性。传统线性编码器虽具有信号无关性和鲁棒性,但会放大低频噪声并受高频空间混叠影响。另一方面,神经网络方法虽能超越线性编码器,但通常假设理想化麦克风条件,在实际场景中可能表现不稳定。为结合二者优势,我们提出一种残差学习框架,通过神经网络修正优化线性编码器。基于智能眼镜实测阵列传递函数,我们将文献中的UNet编码器与新型循环注意力模型进行对比。分析表明,两种神经编码器仅在残差学习框架内集成时,才能持续超越线性基线。在残差配置下,两种神经模型在域内数据的所有测试指标上均取得稳定且显著的改进,对域外数据亦获得适度提升。然而,相干性分析表明,所有神经编码器配置在高频方向精确编码方面仍存在困难。