Misophonia Trigger Sound Detection on Synthetic Soundscapes Using a Hybrid Model with a Frozen Pre-Trained CNN and a Time-Series Module

Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot "eating sound" detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.

翻译：恐音症是一种以对特定日常声音（触发声音）耐受性降低为特征的障碍，这些声音可能引发强烈的负面情绪反应，如愤怒、恐慌或焦虑。这些反应会严重损害日常功能和生活质量。能够选择性检测触发声音的辅助技术有助于减轻痛苦并改善健康状况。在本研究中，我们探索声音事件检测（SED）技术，以在连续环境音频中定位触发声音的区间，作为实现此类辅助支持的基础步骤。鉴于真实世界恐音症数据的稀缺性，我们采用音频合成技术生成了专门用于恐音症触发声音检测的合成声景。随后，我们使用基于CNN的混合模型执行触发声音检测任务。这些模型结合了使用冻结预训练CNN主干进行特征提取与可训练时序模块（如门控循环单元（GRUs）、长短期记忆网络（LSTMs）、回声状态网络（ESNs）及其双向变体）的方法。检测性能使用常见的SED指标进行评估，包括多声音检测分数1（PSDS1）。在多类触发声音SED任务中，双向时序建模持续提升了检测性能，其中双向GRU（BiGRU）实现了最佳的整体准确率。值得注意的是，双向ESN（BiESN）通过仅优化读出层，在所需可训练参数数量减少数个数量级的同时，获得了具有竞争力的性能。我们进一步通过一个最多使用五个支持片段的少样本"进食声音"检测任务来模拟用户个性化，并在其中比较了BiGRU和BiESN。在此严格的适应设置下，BiESN表现出稳健且稳定的性能，这表明轻量级时序模块在个性化恐音症触发声音SED中具有应用前景。