Speech-based machine learning systems are sensitive to noise, complicating reliable deployment in emotion recognition and voice pathology detection. We evaluate the robustness of a hybrid quantum machine learning model, quanvolutional neural networks (QNNs) against classical convolutional neural networks (CNNs) under four acoustic corruptions (Gaussian noise, pitch shift, temporal shift, and speed variation) in a clean-train/corrupted-test regime. Using AVFAD (voice pathology) and TESS (speech emotion), we compare three QNN models (Random, Basic, Strongly) to a simple CNN baseline (CNN-Base), ResNet-18 and VGG-16 using accuracy and corruption metrics (CE, mCE, RCE, RmCE), and analyze architectural factors (circuit complexity or depth, convergence) alongside per-emotion robustness. QNNs generally outperform the CNN-Base under pitch shift, temporal shift, and speed variation (up to 22% lower CE/RCE at severe temporal shift), while the CNN-Base remains more resilient to Gaussian noise. Among quantum circuits, QNN-Basic achieves the best overall robustness on AVFAD, and QNN-Random performs strongest on TESS. Emotion-wise, fear is most robust (80-90% accuracy under severe corruptions), neutral can collapse under strong Gaussian noise (5.5% accuracy), and happy is most vulnerable to pitch, temporal, and speed distortions. QNNs also converge up to six times faster than the CNN-Base. To our knowledge, this is a systematic study of QNN robustness for speech under common non-adversarial acoustic corruptions, indicating that shallow entangling quantum front-ends can improve noise resilience while sensitivity to additive noise remains a challenge.
翻译:基于语音的机器学习系统对噪声敏感,这阻碍了其在情感识别和嗓音病理检测中的可靠部署。我们评估了一种混合量子机器学习模型——量子卷积神经网络(QNNs)的鲁棒性,并与经典卷积神经网络(CNNs)在四种声学干扰(高斯噪声、音高偏移、时间偏移和语速变化)下进行对比,采用干净训练/干扰测试范式。使用AVFAD(嗓音病理)和TESS(语音情感)数据集,我们将三种QNN模型(随机、基础、强纠缠)与一个简单的CNN基线(CNN-Base)、ResNet-18和VGG-16进行比较,采用准确率和干扰度量指标(CE、mCE、RCE、RmCE),并分析了架构因素(电路复杂度或深度、收敛性)以及针对每种情感的鲁棒性。QNNs在音高偏移、时间偏移和语速变化下普遍优于CNN-Base(在严重时间偏移下CE/RCE降低高达22%),而CNN-Base对高斯噪声仍更具抵抗力。在量子电路中,QNN-Basic在AVFAD上实现了最佳的整体鲁棒性,而QNN-Random在TESS上表现最强。就具体情感而言,恐惧最鲁棒(在严重干扰下准确率达80-90%),中性情感在强高斯噪声下可能崩溃(准确率5.5%),而快乐情感对音高、时间和语速失真最敏感。QNNs的收敛速度也比CNN-Base快达六倍。据我们所知,这是对QNN在常见非对抗性声学干扰下语音任务鲁棒性的首次系统性研究,表明浅层纠缠量子前端可以提升噪声鲁棒性,但对加性噪声的敏感性仍是挑战。