Multilingual Audio Captioning using machine translated data

Automated Audio Captioning (AAC) systems attempt to generate a natural language sentence, a caption, that describes the content of an audio recording, in terms of sound events. Existing datasets provide audio-caption pairs, with captions written in English only. In this work, we explore multilingual AAC, using machine translated captions. We translated automatically two prominent AAC datasets, AudioCaps and Clotho, from English to French, German and Spanish. We trained and evaluated monolingual systems in the four languages, on AudioCaps and Clotho. In all cases, the models achieved similar performance, about 75% CIDEr on AudioCaps and 43% on Clotho. In French, we acquired manual captions of the AudioCaps eval subset. The French system, trained on the machine translated version of AudioCaps, achieved significantly better results on the manual eval subset, compared to the English system for which we automatically translated the outputs to French. This advocates in favor of building systems in a target language instead of simply translating to a target language the English captions from the English system. Finally, we built a multilingual model, which achieved results in each language comparable to each monolingual system, while using much less parameters than using a collection of monolingual systems.

翻译：自动音频描述（AAC）系统旨在生成描述音频录音内容（以声音事件形式）的自然语言句子，即描述文本。现有数据集提供音频-描述配对，但描述仅以英文编写。本文探索基于机器翻译描述的多语言音频描述。我们将两个主要AAC数据集AudioCaps和Clotho从英文自动翻译为法文、德文和西班牙文。我们针对AudioCaps和Clotho数据集，训练并评估了四种语言的单语言系统。在所有情况下，模型均达到相近性能：在AudioCaps上CIDEr约为75%，在Clotho上约为43%。针对法文，我们手动获取了AudioCaps评估子集的描述文本。基于AudioCaps机器翻译版本训练的法文系统，在手动评估子集上的结果显著优于通过自动将英文系统输出翻译为法文的英文系统。这支持了在目标语言中构建系统而非简单将英文系统的英文描述翻译为目标语言的做法。最后，我们构建了一个多语言模型，其在每种语言上的表现均与各单语言系统相当，而参数量远少于多个单语言系统的集合。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日