Current synthetic speech detection (SSD) methods perform well on certain datasets but still face issues of robustness and interpretability. A possible reason is that these methods do not analyze the deficiencies of synthetic speech. In this paper, the flaws of the speaker features inherent in the text-to-speech (TTS) process are analyzed. Differences in the temporal consistency of intra-utterance speaker features arise due to the lack of fine-grained control over speaker features in TTS. Since the speaker representations in TTS are based on speaker embeddings extracted by encoders, the distribution of inter-utterance speaker features differs between synthetic and bonafide speech. Based on these analyzes, an SSD method based on temporal consistency and distribution of speaker features is proposed. On one hand, modeling the temporal consistency of intra-utterance speaker features can aid speech anti-spoofing. On the other hand, distribution differences in inter-utterance speaker features can be utilized for SSD. The proposed method offers low computational complexity and performs well in both cross-dataset and silence trimming scenarios.
翻译:当前的合成语音检测方法在特定数据集上表现良好,但仍面临鲁棒性和可解释性问题。一个可能的原因在于这些方法并未深入分析合成语音的缺陷。本文分析了文本转语音过程中说话人特征的固有缺陷。由于TTS缺乏对说话人特征的细粒度控制,导致语句内说话人特征在时序一致性上存在差异。同时,由于TTS中的说话人表征基于编码器提取的说话人嵌入,合成语音与真实语音在语句间说话人特征的分布上存在差异。基于上述分析,本文提出了一种融合说话人特征时序一致性与分布的合成语音检测方法。一方面,对语句内说话人特征时序一致性的建模有助于语音防伪;另一方面,语句间说话人特征的分布差异也可用于合成语音检测。所提方法计算复杂度低,在跨数据集和静音裁剪场景下均表现优异。