Scene recognition is important for hearing devices, however; this is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying such models on resource-constrained edge devices presents another challenge.The proposed solution is two-fold, a repack and refinement of several open source datasets to create AHEAD-DS, a dataset designed for auditory scene recognition for hearing devices, and introduce OpenYAMNet, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. OpenYAMNet is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality, serving as a baseline model for sound-based scene recognition. OpenYAMNet achieved a mean average precision of 0.86 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories relevant to auditory scene recognition. Real-time sound-based scene recognition capabilities were demonstrated on edge devices by deploying OpenYAMNet to an Android smartphone. Even with a 2018 Google Pixel 3, a phone with modest specifications, the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. The project website with links to code, data, and models. https://github.com/Australian-Future-Hearing-Initiative
翻译:场景识别对于听力设备至关重要,然而,这一任务具有挑战性,部分原因在于现有数据集的局限性。现有数据集往往缺乏公开可访问性、完整性或听力学相关标注,这阻碍了机器学习模型的系统性比较。在资源受限的边缘设备上部署此类模型则构成了另一项挑战。本文提出的解决方案包含两方面:对多个开源数据集进行重新整合与精炼,构建了专为听力设备听觉场景识别设计的AHEAD-DS数据集;并引入了声音识别模型OpenYAMNet。AHEAD-DS旨在提供一个标准化、公开可用的数据集,其标注与助听器应用场景保持一致,以促进模型比较。OpenYAMNet专为在边缘设备(如连接听力设备——包括助听器及具备助听功能的无线耳机——的智能手机)上部署而设计,可作为基于声音的场景识别的基准模型。在AHEAD-DS测试集上,针对听觉场景识别相关的十四个类别,OpenYAMNet取得了0.86的平均精度均值与0.93的准确率。通过将OpenYAMNet部署至Android智能手机,在边缘设备上实现了实时的基于声音的场景识别能力。即使在配置中等的2018年款Google Pixel 3手机上,模型加载音频的延迟约为50毫秒,且每处理1秒音频的延迟线性增加约30毫秒。项目网站提供代码、数据及模型的访问链接:https://github.com/Australian-Future-Hearing-Initiative