Measuring audio prompt adherence with distribution-based embedding distances

An increasing number of generative music models can be conditioned on an audio prompt that serves as musical context for which the model is to create an accompaniment (often further specified using a text prompt). Evaluation of how well model outputs adhere to the audio prompt is often done in a model or problem specific manner, presumably because no generic evaluation method for audio prompt adherence has emerged. Such a method could be useful both in the development and training of new models, and to make performance comparable across models. In this paper we investigate whether commonly used distribution-based distances like Fr\'echet Audio Distance (FAD), can be used to measure audio prompt adherence. We propose a simple procedure based on a small number of constituents (an embedding model, a projection, an embedding distance, and a data fusion method), that we systematically assess using a baseline validation. In a follow-up experiment we test the sensitivity of the proposed audio adherence measure to pitch and time shift perturbations. The results show that the proposed measure is sensitive to such perturbations, even when the reference and candidate distributions are from different music collections. Although more experimentation is needed to answer unaddressed questions like the robustness of the measure to acoustic artifacts that do not affect the audio prompt adherence, the current results suggest that distribution-based embedding distances provide a viable way of measuring audio prompt adherence. An python/pytorch implementation of the proposed measure is publicly available as a github repository.

翻译：越来越多的生成式音乐模型能够以音频提示作为音乐上下文条件进行创作，模型需依据该上下文生成伴奏（通常还可通过文本提示进一步指定）。评估模型输出对音频提示的遵循程度通常采用特定模型或问题的方法，这可能是因为尚未出现通用的音频提示遵循度评估方法。此类方法既有助于新模型的开发与训练，也能使不同模型的性能具有可比性。本文研究了常用基于分布的距离度量（如Frechet音频距离，FAD）是否可用于衡量音频提示遵循度。我们提出了一种基于少量组件（嵌入模型、投影、嵌入距离和数据融合方法）的简单流程，并通过基线验证对其进行系统评估。在后续实验中，我们测试了所提出的音频遵循度测量方法对音高和时间偏移扰动的敏感性。结果表明，即使参考分布与候选分布来自不同音乐集合，该测量方法仍对这些扰动敏感。尽管还需进一步实验解决诸如声学伪影（不影响音频提示遵循度）下测量方法的鲁棒性等未解问题，但当前结果表明，基于分布的嵌入距离为衡量音频提示遵循度提供了可行方案。本文所提方法的Python/PyTorch实现已作为GitHub仓库公开。