Automatic Speech Understanding (ASU) leverages the power of deep learning models for accurate interpretation of human speech, leading to a wide range of speech applications that enrich the human experience. However, training a robust ASU model requires the curation of a large number of speech samples, creating risks for privacy breaches. In this work, we investigate using foundation models to assist privacy-enhancing speech computing. Unlike conventional works focusing primarily on data perturbation or distributed algorithms, our work studies the possibilities of using pre-trained generative models to synthesize speech content as training data with just label guidance. We show that zero-shot learning with training label-guided synthetic speech content remains a challenging task. On the other hand, our results demonstrate that the model trained with synthetic speech samples provides an effective initialization point for low-resource ASU training. This result reveals the potential to enhance privacy by reducing user data collection but using label-guided synthetic speech content.
翻译:自动语音理解(ASU)利用深度学习模型精确解读人类语音,催生了丰富人类体验的广泛语音应用。然而,训练鲁棒的ASU模型需要整理大量语音样本,这带来了隐私泄露风险。本研究探索利用基础模型辅助隐私增强型语音计算。不同于传统工作主要关注数据扰动或分布式算法,我们研究了使用预训练生成模型仅通过标签引导合成语音内容作为训练数据的可能性。研究表明,基于训练标签引导的合成语音内容进行零样本学习仍具挑战性。另一方面,我们的结果证明,使用合成语音样本训练的模型可为低资源ASU训练提供有效的初始化点。这一结果揭示了通过减少用户数据收集、转而使用标签引导的合成语音内容来增强隐私保护的潜力。