In education, the traditional Automatic Short Answer Grading (ASAG) with feedback problem has focused primarily on evaluating text-only responses. However, real-world assessments often include multimodal responses containing both diagrams and text. To address this limitation, we introduce the Multimodal Short Answer Grading with Feedback (MMSAF) problem, which requires jointly evaluating textual and diagrammatic content while also providing explanatory feedback. Collecting data representative of such multimodal responses is challenging due to both scale and logistical constraints. To mitigate this, we develop an automated data generation framework that leverages LLM hallucinations to mimic common student errors, thereby constructing a dataset of 2,197 instances. We evaluate 4 Multimodal Large Language Models (MLLMs) across 3 STEM subjects, showing that MLLMs achieve accuracies of up to 62.5% in predicting answer correctness (correct/partially correct/incorrect) and up to 80.36% in assessing image relevance. This also includes a human evaluation with 9 annotators across 5 parameters, including a rubric-based approach. The rubrics also serve as a way to evaluate the feedback quality semantically rather than using overlap-based approaches. Our findings highlight which MLLMs are better suited for such tasks while also pointing out to drawbacks of the remaining MLLMs.
翻译:在教育领域,传统的自动简答题评分与反馈问题主要集中于评估纯文本回答。然而,现实世界中的评估常包含同时涵盖图表与文本的多模态回答。为应对这一局限,我们提出了多模态简答题评分与反馈问题,该问题要求同时评估文本与图表内容,并提供解释性反馈。由于规模与实施限制,收集具有代表性的此类多模态回答数据颇具挑战。为此,我们开发了一个自动化数据生成框架,该框架利用大语言模型的幻觉来模拟常见的学生错误,从而构建了一个包含2,197个实例的数据集。我们在3个STEM学科中评估了4种多模态大语言模型,结果显示,在预测答案正确性(正确/部分正确/错误)方面,MLLMs的准确率最高可达62.5%;在评估图像相关性方面,最高可达80.36%。研究还包括一项由9名标注者参与、涵盖5个参数的人工评估,其中采用了基于评分量规的方法。这些评分量规也作为一种方式,用于从语义层面而非基于重叠度的方法来评估反馈质量。我们的研究结果明确了哪些MLLMs更适合此类任务,同时也指出了其余MLLMs的不足之处。