Most recent speech privacy efforts have focused on anonymizing acoustic speaker attributes but there has not been as much research into protecting information from speech content. We introduce a toy problem that explores an emerging type of privacy called "content masking" which conceals selected words and phrases in speech. In our efforts to define this problem space, we evaluate an introductory baseline masking technique based on modifying sequences of discrete phone representations (phone codes) produced from a pre-trained vector-quantized variational autoencoder (VQ-VAE) and re-synthesized using WaveRNN. We investigate three different masking locations and three types of masking strategies: noise substitution, word deletion, and phone sequence reversal. Our work attempts to characterize how masking affects two downstream tasks: automatic speech recognition (ASR) and automatic speaker verification (ASV). We observe how the different masks types and locations impact these downstream tasks and discuss how these issues may influence privacy goals.
翻译:近期的语音隐私研究大多聚焦于匿名化声学说话者属性,但在保护语音内容信息方面的研究相对不足。我们引入了一个探索新兴隐私类型“内容屏蔽”的简易问题——该技术可隐藏语音中选定的词汇与短语。为定义该问题空间,我们评估了一种基于离散音素表征序列(音素编码)修改的初始基线屏蔽技术,该序列由预训练矢量量化变分自编码器(VQ-VAE)生成,并通过WaveRNN重新合成。我们研究了三种不同的屏蔽位置及三种屏蔽策略:噪声替换、词语删除和音素序列反转。本研究尝试揭示屏蔽如何影响两项下游任务:自动语音识别(ASR)与自动说话人验证(ASV)。我们观察到不同屏蔽类型和位置对下游任务的影响,并探讨这些问题可能如何影响隐私保护目标。