Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.

翻译：多模态大语言模型（MLLMs）为通过基于教育内容的对话系统支持多媒体学习提供了契机。然而，尽管对话式人工智能已知能提升参与度，其在视觉丰富的理工科（STEM）领域中的学习影响仍未被充分探索。此外，关于多模态性与对话性如何共同影响生成式人工智能系统学习效果的研究尚显不足。本研究报告了一项随机对照在线实验（N = 124）的结果，该实验比较了从教材内容学习生物学的三种方法：（1）基于文档的对话式人工智能系统，采用文本与图像交错的响应模式（MuDoC）；（2）基于文档的对话式人工智能系统，仅提供文本响应（TexDoC）；（3）配备语义搜索与高亮功能的教材界面（DocSearch）。使用MuDoC的学习者获得了最高的后测成绩，并报告了最积极的学习体验。值得注意的是，尽管TexDoC在参与度和易用性方面显著优于DocSearch，但其后测成绩最低，揭示了学生主观感受与学习效果之间的脱节。基于认知负荷理论的分析表明，对话性降低了外在负荷，而多模态引发的视觉-语言整合则增加了相关负荷，从而提升了学习效果。当对话性缺乏多模态补充时，认知努力的降低反而可能夸大感知理解，却未改善实际学习成效。