Randomized smoothing (RS) certifies robustness in the vector space where Gaussian noise is added. In audio classification, this space is often not uniquely defined as standard pipelines normalize, range-control, and transform waveforms into log-mel or other spectral features. We show that direct RS is therefore under-specified unless the certified object and preprocessing policy are explicit. On two audio benchmarks, keyword spotting and environmental-sound classification, we study waveform, feature-space, and post-processed smoothing. Our diagnostics show why representation-aware reporting is necessary: at the same smoothing level $σ=0.0025$, the two datasets share the same median raw radius $.007996$, but different waveform energies yield different SNR-equivalent scales ($83.98$ vs. $90.97$ dB); log-mel smoothing gives higher positive-radius certified accuracy on environmental sounds ($68.42\%$ vs. $65.53\%$), certifying more examples with nonzero radius but over features rather than waveforms; and clipping or peak normalization changes the effective perturbation norm by roughly $230$--$351\times$. We therefore recommend that audio RS studies choose and report the task-specific certified object and perturbation model, including the perturbation location, gain policy, raw radius, and any post-noise geometry changes.
翻译:随机平滑通过在高斯噪声添加的向量空间中认证模型的鲁棒性。在音频分类中,由于标准流程会对波形进行归一化、动态范围控制并转换为对数梅尔谱或其他频谱特征,该空间往往并非唯一确定。我们证明,除非明确认证对象和预处理策略,否则直接应用随机平滑会导致定义不明确。以关键词识别和环境声音分类两个音频基准为例,我们研究了波形域、特征域和后处理平滑的认证方法。诊断结果表明,必须采用考虑表示特征的报告方式:在相同平滑水平σ=0.0025下,两个数据集的中位原始半径同为0.007996,但不同波形能量导致信噪比等效尺度存在差异(83.98 dB vs. 90.97 dB);对数梅尔平滑对环境声音的正半径认证准确率更高(68.42% vs. 65.53%),即虽能对更多样本进行非零半径认证,但认证对象是特征而非波形;此外,裁剪或峰值归一化会改变有效扰动范数约230–351倍。因此,我们建议音频随机平滑研究需明确选取并报告任务特定的认证对象及扰动模型,包括扰动位置、增益策略、原始半径及后噪声几何变换等要素。