量化量子卷积神经网络在医疗应用语音任务中的鲁棒性 (Quantifying Quanvolutional Neural Networks Robustness for Speech in Healthcare Applications)

Speech-based machine learning systems are sensitive to noise, complicating reliable deployment in emotion recognition and voice pathology detection. We evaluate the robustness of a hybrid quantum machine learning model, quanvolutional neural networks (QNNs) against classical convolutional neural networks (CNNs) under four acoustic corruptions (Gaussian noise, pitch shift, temporal shift, and speed variation) in a clean-train/corrupted-test regime. Using AVFAD (voice pathology) and TESS (speech emotion), we compare three QNN models (Random, Basic, Strongly) to a simple CNN baseline (CNN-Base), ResNet-18 and VGG-16 using accuracy and corruption metrics (CE, mCE, RCE, RmCE), and analyze architectural factors (circuit complexity or depth, convergence) alongside per-emotion robustness. QNNs generally outperform the CNN-Base under pitch shift, temporal shift, and speed variation (up to 22% lower CE/RCE at severe temporal shift), while the CNN-Base remains more resilient to Gaussian noise. Among quantum circuits, QNN-Basic achieves the best overall robustness on AVFAD, and QNN-Random performs strongest on TESS. Emotion-wise, fear is most robust (80-90% accuracy under severe corruptions), neutral can collapse under strong Gaussian noise (5.5% accuracy), and happy is most vulnerable to pitch, temporal, and speed distortions. QNNs also converge up to six times faster than the CNN-Base. To our knowledge, this is a systematic study of QNN robustness for speech under common non-adversarial acoustic corruptions, indicating that shallow entangling quantum front-ends can improve noise resilience while sensitivity to additive noise remains a challenge.

翻译：基于语音的机器学习系统对噪声敏感，这阻碍了其在情感识别和嗓音病理检测中的可靠部署。我们评估了一种混合量子机器学习模型——量子卷积神经网络（QNNs）的鲁棒性，并与经典卷积神经网络（CNNs）在四种声学干扰（高斯噪声、音高偏移、时间偏移和语速变化）下进行对比，采用干净训练/干扰测试范式。使用AVFAD（嗓音病理）和TESS（语音情感）数据集，我们将三种QNN模型（随机、基础、强纠缠）与一个简单的CNN基线（CNN-Base）、ResNet-18和VGG-16进行比较，采用准确率和干扰度量指标（CE、mCE、RCE、RmCE），并分析了架构因素（电路复杂度或深度、收敛性）以及针对每种情感的鲁棒性。QNNs在音高偏移、时间偏移和语速变化下普遍优于CNN-Base（在严重时间偏移下CE/RCE降低高达22%），而CNN-Base对高斯噪声仍更具抵抗力。在量子电路中，QNN-Basic在AVFAD上实现了最佳的整体鲁棒性，而QNN-Random在TESS上表现最强。就具体情感而言，恐惧最鲁棒（在严重干扰下准确率达80-90%），中性情感在强高斯噪声下可能崩溃（准确率5.5%），而快乐情感对音高、时间和语速失真最敏感。QNNs的收敛速度也比CNN-Base快达六倍。据我们所知，这是对QNN在常见非对抗性声学干扰下语音任务鲁棒性的首次系统性研究，表明浅层纠缠量子前端可以提升噪声鲁棒性，但对加性噪声的敏感性仍是挑战。