Manual annotation of audio datasets is labour intensive, and it is challenging to balance label granularity with acoustic separability. We introduce AuditoryHuM, a novel framework for the unsupervised discovery and clustering of auditory scene labels using a collaborative Human-Multimodal Large Language Model (MLLM) approach. By leveraging MLLMs (Gemma and Qwen) the framework generates contextually relevant labels for audio data. To ensure label quality and mitigate hallucinations, we employ zero-shot learning techniques (Human-CLAP) to quantify the alignment between generated text labels and raw audio content. A strategically targeted human-in-the-loop intervention is then used to refine the least aligned pairs. The discovered labels are grouped into thematically cohesive clusters using an adjusted silhouette score that incorporates a penalty parameter to balance cluster cohesion and thematic granularity. Evaluated across three diverse auditory scene datasets (ADVANCE, AHEAD-DS, and TAU 2019), AuditoryHuM provides a scalable, low-cost solution for creating standardised taxonomies. This solution facilitates the training of lightweight scene recognition models deployable to edge devices, such as hearing aids and smart home assistants. The project page and code: https://github.com/Australian-Future-Hearing-Initiative
翻译:音频数据的手动标注劳动密集度高,且难以在标签粒度与声学可分离性之间取得平衡。本文提出AuditoryHuM,一种基于人机协作(人类-多模态大语言模型)的新型框架,用于无监督地发现和聚类听觉场景标签。该框架通过利用多模态大语言模型(Gemma与Qwen)为音频数据生成上下文相关的标签。为确保标签质量并缓解幻觉问题,我们采用零样本学习技术(Human-CLAP)量化生成的文本标签与原始音频内容之间的对齐程度。随后通过策略性定向的人机交互干预,对对齐度最低的配对进行优化。所发现的标签通过调整轮廓系数进行主题一致性聚类,该系数引入惩罚参数以平衡簇内凝聚度与主题粒度。在三个不同的听觉场景数据集(ADVANCE、AHEAD-DS和TAU 2019)上的评估表明,AuditoryHuM为创建标准化分类体系提供了可扩展、低成本的解决方案。该方案有助于训练可部署于边缘设备(如助听器和智能家居助手)的轻量级场景识别模型。项目页面与代码:https://github.com/Australian-Future-Hearing-Initiative