Towards Environmental Preference Based Speech Enhancement For Individualised Multi-Modal Hearing Aids

Since the advent of Deep Learning (DL), Speech Enhancement (SE) models have performed well under a variety of noise conditions. However, such systems may still introduce sonic artefacts, sound unnatural, and restrict the ability for a user to hear ambient sound which may be of importance. Hearing Aid (HA) users may wish to customise their SE systems to suit their personal preferences and day-to-day lifestyle. In this paper, we introduce a preference learning based SE (PLSE) model for future multi-modal HAs that can contextually exploit audio information to improve listening comfort, based upon the preferences of the user. The proposed system estimates the Signal-to-noise ratio (SNR) as a basic objective speech quality measure which quantifies the relative amount of background noise present in speech, and directly correlates to the intelligibility of the signal. Additionally, to provide contextual information we predict the acoustic scene in which the user is situated. These tasks are achieved via a multi-task DL model, which surpasses the performance of inferring the acoustic scene or SNR separately, by jointly leveraging a shared encoded feature space. These environmental inferences are exploited in a preference elicitation framework, which linearly learns a set of predictive functions to determine the target SNR of an AV (Audio-Visual) SE system. By greatly reducing noise in challenging listening conditions, and by novelly scaling the output of the SE model, we are able to provide HA users with contextually individualised SE. Preliminary results suggest an improvement over the non-individualised baseline model in some participants.

翻译：自深度学习（DL）问世以来，语音增强（SE）模型在各种噪声条件下均表现良好。然而，此类系统仍可能引入声音伪影、听起来不自然，并限制用户聆听可能重要的环境声音的能力。助听器（HA）用户可能希望定制其SE系统，以符合个人偏好和日常生活方式。本文提出了一种基于偏好学习的语音增强（PLSE）模型，用于未来的多模态HA，该模型能够根据用户偏好，在情境中利用音频信息提升聆听舒适度。所提系统将信噪比（SNR）作为基本的客观语音质量度量进行估计，该度量量化了语音中背景噪声的相对量，并直接与信号的可懂度相关。此外，为提供上下文信息，我们预测了用户所处的声学场景。这些任务通过一个多任务深度学习模型实现，该模型通过联合利用共享的编码特征空间，在单独推断声学场景或SNR的性能上表现更优。这些环境推断被应用于一个偏好引导框架中，该框架线性学习一组预测函数，以确定音视频（AV）SE系统的目标SNR。通过在具有挑战性的聆听条件下大幅降低噪声，并以新颖方式缩放SE模型的输出，我们能够为HA用户提供情境个性化的SE。初步结果表明，与未个性化的基线模型相比，部分参与者的效果有所提升。