The ability to jointly learn from multiple modalities, such as text, audio, and visual data, is a defining feature of intelligent systems. While there have been promising advances in designing neural networks to harness multimodal data, the enormous success of data augmentation currently remains limited to single-modality tasks like image classification. Indeed, it is particularly difficult to augment each modality while preserving the overall semantic structure of the data; for example, a caption may no longer be a good description of an image after standard augmentations have been applied, such as translation. Moreover, it is challenging to specify reasonable transformations that are not tailored to a particular modality. In this paper, we introduce LeMDA, Learning Multimodal Data Augmentation, an easy-to-use method that automatically learns to jointly augment multimodal data in feature space, with no constraints on the identities of the modalities or the relationship between modalities. We show that LeMDA can (1) profoundly improve the performance of multimodal deep learning architectures, (2) apply to combinations of modalities that have not been previously considered, and (3) achieve state-of-the-art results on a wide range of applications comprised of image, text, and tabular data.
翻译:多模态数据的联合学习能力——例如同时处理文本、音频和视觉数据——是智能系统的核心特征。尽管设计神经网络以利用多模态数据已取得显著进展,但数据增强的巨大成功目前仍局限于图像分类等单模态任务。实际上,在增强每个模态时保持数据的整体语义结构尤为困难:例如,经过平移等标准增强操作后,描述文本可能不再适合对应图像。此外,指定不特定于某一模态的合理转换方式也颇具挑战性。本文提出LeMDA(学习多模态数据增强),一种易于使用的自动方法,可在特征空间中联合增强多模态数据,且对模态类型或模态间关系无任何约束。实验表明,LeMDA能够:(1) 显著提升多模态深度学习架构的性能;(2) 适用于此前未被研究的模态组合;(3) 在涵盖图像、文本和表格数据的广泛应用中取得最先进结果。