There has been growing interest in audio-language retrieval research, where the objective is to establish the correlation between audio and text modalities. However, most audio-text paired datasets often lack rich expression of the text data compared to the audio samples. One of the significant challenges facing audio-text datasets is the presence of similar or identical captions despite different audio samples. Therefore, under many-to-one mapping conditions, audio-text datasets lead to poor performance of retrieval tasks. In this paper, we propose a novel approach to tackle the data imbalance problem in audio-language retrieval task. To overcome the limitation, we introduce a method that employs a distance sampling-based paraphraser leveraging ChatGPT, utilizing distance function to generate a controllable distribution of manipulated text data. For a set of sentences with the same context, the distance is used to calculate a degree of manipulation for any two sentences, and ChatGPT's few-shot prompting is performed using a text cluster with a similar distance defined by the Jaccard similarity. Therefore, ChatGPT, when applied to few-shot prompting with text clusters, can adjust the diversity of the manipulated text based on the distance. The proposed approach is shown to significantly enhance performance in audio-text retrieval, outperforming conventional text augmentation techniques.
翻译:近年来,音频-语言检索研究日益受到关注,其目标是建立音频与文本模态之间的关联。然而,大多数音频-文本配对数据集相较于音频样本,往往缺乏文本数据的丰富表达。音频-文本数据集面临的主要挑战之一在于,尽管音频样本不同,却存在相似甚至完全相同的文本描述。因此,在多对一映射条件下,音频-文本数据集会导致检索任务性能下降。本文提出一种新颖方法来解决音频-语言检索任务中的数据不平衡问题。为克服这一局限性,我们引入了一种基于距离采样的释义器方法,利用ChatGPT中的距离函数生成可控分布的操控文本数据。对于一组具有相同上下文的句子,通过距离计算任意两个句子之间的操控程度,并利用Jaccard相似度定义的相似距离文本簇执行ChatGPT的少样本提示。因此,借助文本簇进行少样本提示时,ChatGPT可根据距离调整操控文本的多样性。实验表明,该方法在音频-文本检索中显著提升了性能,优于传统的文本增强技术。