Adversarial attacks represent a security threat to machine learning based automatic speech recognition (ASR) systems. To prevent such attacks we propose an adversarial example detection strategy applicable to any ASR system that predicts a probability distribution over output tokens in each time step. We measure a set of characteristics of this distribution: the median, maximum, and minimum over the output probabilities, the entropy, and the Jensen-Shannon divergence of the distributions of subsequent time steps. Then, we fit a Gaussian distribution to the characteristics observed for benign data. By computing the likelihood of incoming new audio we can distinguish malicious inputs from samples from clean data with an area under the receiving operator characteristic (AUROC) higher than 0.99, which drops to 0.98 for less-quality audio. To assess the robustness of our method we build adaptive attacks. This reduces the AUROC to 0.96 but results in more noisy adversarial clips.
翻译:对抗性攻击对基于机器学习的自动语音识别(ASR)系统构成安全威胁。为防范此类攻击,我们提出一种适用于任何在每个时间步输出词元概率分布的ASR系统的对抗性示例检测策略。我们测量该分布的一组特征:输出概率的中位数、最大值和最小值、熵,以及相邻时间步分布之间的詹森-香农散度。随后,我们针对良性数据观测到的特征拟合高斯分布。通过计算新输入音频的似然性,我们能区分恶意输入与干净数据样本,接受者操作特征曲线下面积(AUROC)高于0.99,对于质量较差的音频,该值降至0.98。为评估方法的鲁棒性,我们构建自适应攻击,使得AUROC降至0.96,但会导致对抗性音频片段噪声增大。