Textbooks are one of the main mediums for delivering high-quality education to students. In particular, explanatory and illustrative visuals play a key role in retention, comprehension and general transfer of knowledge. However, many textbooks lack these interesting visuals to support student learning. In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web. We collect a dataset of e-textbooks in the math, science, social science and business domains. We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks, which we frame as a matching optimization problem. Through a crowd-sourced evaluation, we verify that (1) while the original textbook images are rated higher, automatically assigned ones are not far behind, and (2) the precise formulation of the optimization problem matters. We release the dataset of textbooks with an associated image bank to inspire further research in this intersectional area of computer vision and NLP for education.
翻译:教材是向学生传递高质量教育的主要媒介之一。其中,解释性和说明性视觉元素在知识保持、理解及一般性迁移中起着关键作用。然而,许多教材缺乏这些能支持学生学习的趣味性视觉内容。本文探究了视觉-语言模型在自动利用网络图像增强教材方面的有效性。我们收集了涵盖数学、科学、社会科学和商业领域的电子教材数据集,并构建了一个文本-图像匹配任务——该任务涉及为教材检索并合理分配网络图像,我们将此问题建模为匹配优化问题。通过众包评估,我们验证了:(1)尽管原始教材图像评分更高,但自动分配的图像与之差距不大;(2)优化问题的具体公式化表述至关重要。我们公开了该教材数据集及其关联图像库,旨在激励计算机视觉与自然语言处理这一交叉领域在教育应用中的进一步研究。