Multimodal Deep Learning enhances decision-making by integrating diverse information sources, such as texts, images, audio, and videos. To develop trustworthy multimodal approaches, it is essential to understand how uncertainty impacts these models. We propose LUMA, a unique benchmark dataset, featuring audio, image, and textual data from 50 classes, for learning from uncertain and multimodal data. It extends the well-known CIFAR 10/100 dataset with audio samples extracted from three audio corpora, and text data generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the controlled injection of varying types and degrees of uncertainty to achieve and tailor specific experiments and benchmarking initiatives. LUMA is also available as a Python package including the functions for generating multiple variants of the dataset with controlling the diversity of the data, the amount of noise for each modality, and adding out-of-distribution samples. A baseline pre-trained model is also provided alongside three uncertainty quantification methods: Monte-Carlo Dropout, Deep Ensemble, and Reliable Conflictive Multi-View Learning. This comprehensive dataset and its benchmarking tools are intended to promote and support the development, evaluation, and benchmarking of trustworthy and robust multimodal deep learning approaches. We anticipate that the LUMA dataset will help the ICLR community to design more trustworthy and robust machine learning approaches for safety critical applications.
翻译:多模态深度学习通过整合文本、图像、音频和视频等多种信息源来增强决策能力。为开发可信的多模态方法,理解不确定性如何影响这些模型至关重要。本文提出LUMA——一个独特的基准数据集,包含来自50个类别的音频、图像与文本数据,专为不确定性与多模态数据学习而设计。该数据集基于经典的CIFAR 10/100数据集扩展,通过从三个音频语料库提取音频样本,并利用Gemma-7B大语言模型生成文本数据构建而成。LUMA支持受控注入不同类型与程度的不确定性,以实现定制化的实验设计与基准测试。该数据集同时提供Python工具包,包含可控制数据多样性、各模态噪声量及添加分布外样本的数据集变体生成功能。我们还提供了基线预训练模型及三种不确定性量化方法:蒙特卡洛丢弃法、深度集成法与可靠冲突多视图学习法。这一综合性数据集及其基准测试工具旨在推动可信鲁棒的多模态深度学习方法的开发、评估与性能评测。我们期待LUMA数据集能帮助ICLR社区为安全关键型应用设计更可信、更鲁棒的机器学习方法。