FAIR4Cov: Fused Audio Instance and Representation for COVID-19 Detection

Audio-based classification techniques on body sounds have long been studied to support diagnostic decisions, particularly in pulmonary diseases. In response to the urgency of the COVID-19 pandemic, a growing number of models are developed to identify COVID-19 patients based on acoustic input. Most models focus on cough because the dry cough is the best-known symptom of COVID-19. However, other body sounds, such as breath and speech, have also been revealed to correlate with COVID-19 as well. In this work, rather than relying on a specific body sound, we propose Fused Audio Instance and Representation for COVID-19 Detection (FAIR4Cov). It relies on constructing a joint feature vector obtained from a plurality of body sounds in waveform and spectrogram representation. The core component of FAIR4Cov is a self-attention fusion unit that is trained to establish the relation of multiple body sounds and audio representations and integrate it into a compact feature vector. We set up our experiments on different combinations of body sounds using only waveform, spectrogram, and a joint representation of waveform and spectrogram. Our findings show that the use of self-attention to combine extracted features from cough, breath, and speech sounds leads to the best performance with an Area Under the Receiver Operating Characteristic Curve (AUC) score of 0.8658, a sensitivity of 0.8057, and a specificity of 0.7958. This AUC is 0.0227 higher than the one of the models trained on spectrograms only and 0.0847 higher than the one of the models trained on waveforms only. The results demonstrate that the combination of spectrogram with waveform representation helps to enrich the extracted features and outperforms the models with single representation.

翻译：基于人体声音的音频分类技术长期以来被用于辅助肺部疾病的诊断决策。为应对新冠肺炎大流行的紧迫性，越来越多的模型被开发用于基于声学输入识别新冠肺炎患者。多数模型聚焦于咳嗽声，因为干咳是新冠肺炎最典型的症状。然而，其他人体声音（如呼吸声与语音）也被证实与新冠肺炎存在关联。本文提出融合音频实例与表示的新冠肺炎检测方法（FAIR4Cov），其不依赖特定类型的人体声音，而是通过构建由多种人体声音的波形与频谱图表示组成的联合特征向量实现检测。该方法的核心组件是自注意力融合单元，该单元经过训练可建立多种人体声音与音频表示间的关联，并将其整合为紧凑特征向量。我们基于仅使用波形、仅使用频谱图以及波形与频谱图联合表示等不同人体声音组合设置实验。研究结果表明，采用自注意力机制融合咳嗽、呼吸与语音声音的提取特征，可获得最佳性能：受试者工作特征曲线下面积（AUC）为0.8658，灵敏度为0.8057，特异度为0.7958。该AUC值比仅使用频谱图训练的模型高0.0227，比仅使用波形训练的模型高0.0847。实验结果表明，频谱图与波形表示的结合有助于丰富提取特征，且性能优于单一表示模型。