Sound designers search for sounds in large sound effects libraries using aspects such as sound class or visual context. However, the metadata needed for such search is often missing or incomplete, and requires significant manual effort to add. Existing solutions to automate this task by generating metadata, i.e. captioning, and search using learned embeddings, i.e. text-audio retrieval, are not trained on metadata with the structure and information pertinent to sound design. To this end we propose audiocards, structured metadata grounded in acoustic attributes and sonic descriptors, by exploiting the world knowledge of LLMs. We show that training on audiocards improves downstream text-audio retrieval, descriptive captioning, and metadata generation on professional sound effects libraries. Moreover, audiocards also improve performance on general audio captioning and retrieval over the baseline single-sentence captioning approach. We release a curated dataset of sound effects audiocards to invite further research in audio language modeling for sound design.
翻译:声音设计师在大型音效库中搜索声音时,会依据声音类别或视觉情境等维度进行筛选。然而,此类搜索所需的元数据往往缺失或不完整,且需要大量人工标注工作。现有通过生成元数据(即音频描述)以及利用学习到的嵌入进行检索(即文本-音频检索)来自动化完成此任务的解决方案,均未在具备声音设计所需结构与信息的元数据上进行训练。为此,我们提出audiocards——一种基于声学属性与声音描述符的结构化元数据框架,通过利用大语言模型的世界知识构建而成。研究表明,基于audiocards进行训练能够提升专业音效库中的下游文本-音频检索、描述性字幕生成及元数据生成性能。此外,相较于基线单句描述方法,audiocards在通用音频描述与检索任务上也表现出性能提升。我们发布了一个精选的音效audiocards数据集,以推动面向声音设计的音频语言建模领域的进一步研究。