文本主导：多模态意图检测中的模态偏差研究 (Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection)

The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.

翻译：整合文本、音频和视觉信息的多模态数据的兴起，为研究意图检测等多模态任务创造了新的机遇。本研究探讨了大型语言模型（LLMs）与非LLMs（包括纯文本模型和多模态模型）在多模态意图检测任务中的有效性。我们的研究表明，纯文本LLM模型Mistral-7B在MIntRec-1和MIntRec2.0数据集上分别以约9%和4%的优势超越了大多数竞争性多模态模型。这种性能优势源于这些数据集中存在的强烈文本偏差：超过90%的样本需要文本输入（单独或与其他模态结合）才能实现正确分类。我们也通过人工评估证实了这些数据集的模态偏差。随后，我们提出了一个消除数据集偏差的框架。在去偏处理后，MIntRec-1中超过70%的样本以及MIntRec2.0中超过50%的样本被移除，导致所有模型性能显著下降，其中小型多模态融合模型受影响最为严重，准确率下降超过50-60%。此外，我们通过实证分析探讨了不同模态在具体语境中的相关性。我们的研究结果揭示了多模态意图数据集中模态偏差带来的挑战，并强调需要无偏数据集来有效评估多模态模型。