Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.
翻译:自动作文评分系统预测每篇作文在多个评分标准维度上的分数,其中每个维度遵循有序离散的评分等级。大多数基于大语言模型的自动作文评分方法将评分任务转化为自回归的标记生成过程,并通过解码和解析获得最终分数,使得评分决策隐含在生成过程中。这种处理方式在多模态自动作文评分中尤为敏感,因为视觉输入的有效性在不同作文和评分维度上存在差异。为应对这些局限性,我们提出决策级序数建模方法,该方法通过复用语言模型头在预定义分数标记上提取逐分数逻辑值,将评分转化为显式的序数决策,从而实现在分数空间中的直接优化与分析。针对多模态自动作文评分,DLOM-GF引入门控融合模块,自适应地融合文本与多模态分数逻辑值。对于纯文本自动作文评分,DLOM-DA添加距离感知正则化项以更好地反映序数距离。在多模态EssayJudge数据集上的实验表明,DLOM在各项评分维度上均优于基于生成的监督微调基线方法,且在模态相关性存在异质性时,DLOM-GF能带来进一步的性能提升。在纯文本ASAP/ASAP++基准测试中,DLOM在无视觉输入时仍保持有效性,而DLOM-DA进一步提升了性能,并超越了具有代表性的强基线方法。