Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating individual HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing the measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range preservation of local spatial variation patterns across neighbouring source directions and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbour dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model outperforms existing methods across several evaluation metrics in generating realistic, high-fidelity HRTFs.
翻译:个体头相关传输函数(HRTFs)正开始应用于众多商业沉浸式音频应用中,对于实现逼真的空间音频渲染至关重要。然而,其引入的主要顾虑之一在于,由于HRTF测量过程的复杂性,大规模生成个体HRTFs并不现实。为缓解这一缺陷,研究者提出了HRTF空间上采样方法,旨在减少所需的测量数据。尽管先前的研究已通过不同机器学习(ML)方法取得成效,但这些模型在跨邻近声源方向保持局部空间变化模式的长期一致性,以及在高上采样倍数下的泛化能力方面仍面临挑战。本文提出了一种基于Transformer的新型HRTF上采样架构,利用注意力机制更好地捕获HRTF球面上的空间相关性。通过在球谐(SH)域中工作,我们的模型能够从稀疏输入测量中重建高分辨率HRTF,且精度显著提升。为增强空间连贯性,我们引入了一种邻域差异损失函数,促进幅度平滑性,从而生成更逼真的上采样结果。我们使用感知定位模型和客观频谱畸变指标对方法进行评估。实验表明,在生成逼真、高保真HRTF的多个评估指标上,我们的模型优于现有方法。