Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the training objective. In such scenarios, we propose a new task of AAC error correction and hope to reduce such errors by post-processing AAC outputs. To tackle this problem, we use observation-based rules to corrupt captions without errors, for pseudo grammatically-erroneous sentence generation. One pair of corrupted and clean sentences can thus be used for training. We train a neural network-based model on the synthetic error dataset and apply the model to correct real errors in AAC outputs. Results on two benchmark datasets indicate that our approach significantly improves fluency while maintaining semantic information.
翻译:自动音频字幕生成(AAC)是一项重要的跨模态翻译任务,旨在为音频片段生成描述性文本。然而,由于训练目标的限制,现有AAC模型生成的字幕常出现“重复错误”。针对这一问题,我们提出新的AAC纠错任务,希望通过后处理方式减少此类错误。为攻克该任务,我们基于观察规则对无错误字幕进行干扰,生成伪语法错误句子,从而获得配对的有错-无错训练数据。我们在合成错误数据集上训练神经网络模型,并将其应用于修正AAC输出的真实错误。在两个基准数据集上的实验结果表明,本方法在保持语义信息的前提下显著提升了字幕流畅性。