Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system for which we automatically translated the outputs to French. This advocates in favor of building systems in a target language instead of simply translating to a target language the English captions from the English system. Finally, we built a multilingual model, which achieved results in each language comparable to each monolingual system, while using much less parameters than using a collection of monolingual systems.
翻译:自动音频描述(AAC)系统旨在生成描述音频录音内容(以声音事件形式)的自然语言句子,即描述文本。现有数据集提供音频-描述配对,但描述仅以英文编写。本文探索基于机器翻译描述的多语言音频描述。我们将两个主要AAC数据集AudioCaps和Clotho从英文自动翻译为法文、德文和西班牙文。我们针对AudioCaps和Clotho数据集,训练并评估了四种语言的单语言系统。在所有情况下,模型均达到相近性能:在AudioCaps上CIDEr约为75%,在Clotho上约为43%。针对法文,我们手动获取了AudioCaps评估子集的描述文本。基于AudioCaps机器翻译版本训练的法文系统,在手动评估子集上的结果显著优于通过自动将英文系统输出翻译为法文的英文系统。这支持了在目标语言中构建系统而非简单将英文系统的英文描述翻译为目标语言的做法。最后,我们构建了一个多语言模型,其在每种语言上的表现均与各单语言系统相当,而参数量远少于多个单语言系统的集合。