Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-art speaker recognition system. We demonstrate that the risk is inversely related to the number of comparisons an adversary must consider, i.e., the search space. Risk is high for a small search space but drops as the search space grows ($precision >0.85$ for $<1*10^{6}$ comparisons, $precision <0.5$ for $>3*10^{6}$ comparisons). Next, we show that the nature of a speech recording influences re-identification risk, with non-connected speech (e.g., vowel prolongation) being harder to identify. Our findings suggest that speaker recognition systems can be used to re-identify participants in specific circumstances, but in practice, the re-identification risk appears low.
翻译:大规模、精心整理的语音数据集是推动语音工具在医疗领域应用的基础。然而,此类数据集制作成本高昂,导致数据共享需求日益增长。由于语音可能识别说话者身份(即声纹),共享记录引发了隐私担忧。我们采用最先进的说话人识别系统,在未参考人口统计学或元数据的情况下,研究了语音记录的去身份识别风险。研究表明,该风险与攻击者必须考虑的比对次数(即搜索空间)成反比。在搜索空间较小时风险较高,但随着搜索空间增大而降低(比对次数<1×10⁶时精确率>0.85,比对次数>3×10⁶时精确率<0.5)。此外,语音记录的性质会影响去身份识别风险,非连续语音(如元音延长)更难被识别。我们的发现表明,说话人识别系统在特定情况下可用于重新识别参与者,但实践中这一风险似乎较低。