Large language models are transforming learning, cognition, and research across many fields. Effectively integrating them into professional domains, such as accounting, is a key challenge for enterprise digital transformation. To address this, we define vertical domain accounting reasoning and propose evaluation criteria derived from an analysis of the training data characteristics of representative GLM models. These criteria support systematic study of accounting reasoning and provide benchmarks for performance improvement. Using this framework, we evaluate GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4 on accounting reasoning tasks. Results show that prompt design significantly affects performance, with GPT-4 demonstrating the strongest capability. Despite these gains, current models remain insufficient for real-world enterprise accounting, indicating the need for further optimization to unlock their full practical value.
翻译:大语言模型正在改变众多领域的学习、认知与研究模式。如何将其有效整合至会计等专业领域,是企业数字化转型面临的关键挑战。为此,我们界定了垂直领域的会计推理任务,并基于对代表性GLM模型训练数据特征的分析,提出了一套评估标准。该标准支持对会计推理能力进行系统性研究,并为性能提升提供了基准。运用此框架,我们对GLM-6B、GLM-130B、GLM-4以及OpenAI GPT-4在会计推理任务上进行了评估。结果表明,提示设计对模型性能有显著影响,其中GPT-4展现出最强的推理能力。尽管取得了这些进展,现有模型仍不足以满足现实企业会计应用的需求,这表明需要进一步优化以释放其全部实用价值。