Automated scoring of ESG narrative disclosures with large language models (LLMs) is gaining traction, yet whether reasoning-heavy frontier models add value commensurate with their cost remains empirically unsettled. We evaluate this question on a corpus of ten Japanese listed firms across three rubric axes -- quantitative targets, progress-tracking infrastructure, and external-standard alignment -- using a four-model consensus design that combines a reasoning-on frontier model with three reasoning-off contemporaries. Across 120 firm x axis x model scores, the pooled mean absolute deviation between the reasoning-on model and each reasoning-off counterpart is 0.38 on a 5-point scale; only 2% of pairwise comparisons reach a two-point deviation, and none exceeds two points. Per-firm cost accounting shows the reasoning-on arm alone costs roughly 5.6x as much as the three-provider reasoning-off ensemble, for outcomes that differ only within small margins. We conclude that in span-based ESG narrative scoring, reasoning-heavy deployment does not materially improve outcomes relative to reasoning-off consensus, while substantially increasing operational cost. We discuss implications for cost-effective ESG auto-scoring pipelines and LLM deployment governance in applied accountability settings. An earlier version of this work is available on SSRN (Abstract ID 6683303).
翻译:基于大语言模型自动评分ESG叙事披露正逐渐兴起,但推理密集型前沿模型是否带来与其成本匹配的增值,目前仍缺乏实证验证。本文以十家日本上市公司为样本,围绕定量目标、进展追踪基础设施和外部标准对齐三个评估维度,采用四模型共识设计——结合一个推理密集型前沿模型与三个非推理密集型同期模型。在120个(公司×维度×模型)评分数据中,推理密集型模型与各非推理密集型模型之间的汇总平均绝对偏差为0.38(5分量表);仅2%的成对比较存在两分偏差,无一超过两分。按公司成本核算显示,单一推理密集型模型的成本约为三个非推理密集型模型联合成本的5.6倍,而评分结果差异仅在微小范围内。我们由此得出结论:在基于区间的ESG叙事评分中,推理密集型部署相较于非推理密集型共识并未实质性改善结果,同时显著增加了运营成本。本文探讨了对成本效益型ESG自动评分流程及实际问责场景下大语言模型部署治理的启示。本工作早期版本可在SSRN查阅(摘要编号6683303)。