Distributional shift is a central challenge in the deployment of machine learning models as they can be ill-equipped for real-world data. This is particularly evident in text-to-audio generation where the encoded representations are easily undermined by unseen prompts, which leads to the degradation of generated audio -- the limited set of the text-audio pairs remains inadequate for conditional audio generation in the wild as user prompts are under-specified. In particular, we observe a consistent audio quality degradation in generated audio samples with user prompts, as opposed to training set prompts. To this end, we present a retrieval-based in-context prompt editing framework that leverages the training captions as demonstrative exemplars to revisit the user prompts. We show that the framework enhanced the audio quality across the set of collected user prompts, which were edited with reference to the training captions as exemplars.
翻译:分布偏移是机器学习模型部署中的核心挑战,因为它可能导致模型难以适配真实世界的数据。这一问题在文本到音频生成领域尤为突出:编码表示容易受到未见提示的干扰,从而导致生成的音频质量下降——有限的文本-音频配对数据不足以支持开放环境下的条件音频生成,因为用户提示往往不够明确。具体而言,我们观察到与训练集提示相比,使用用户提示生成的音频样本存在持续性的质量退化。为此,我们提出了一种基于检索的上下文提示编辑框架,该框架利用训练集中的文本描述作为示范示例,对用户提示进行重新审校。实验表明,该框架显著提升了收集到的用户提示集合的音频生成质量,而提示的编辑过程正是以训练文本描述作为参考示例来完成的。