Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.
翻译:大语言模型(LLMs)正在重塑广泛领域内的学习范式、认知过程和研究方法。将LLMs与专业领域相结合,并重新定义LLMs与领域特定应用之间的关系,已成为推动企业数字化转型和更广泛社会发展的关键挑战。为了将LLMs有效整合到会计领域,理解其领域特定的推理能力至关重要。本研究引入了垂直领域会计推理的概念,并通过分析代表性GLM系列模型的训练数据特征,建立了评估标准。这些标准为后续推理范式的研究提供了基础,并为改进会计推理性能提供了基准。基于此框架,我们在系列会计推理任务上评估了多个代表性模型,包括GLM-6B、GLM-130B、GLM-4和OpenAI GPT-4。实验结果表明,不同的提示工程策略在不同模型上带来了不同程度的性能提升,其中GPT-4展现出最强的会计推理能力。然而,当前的大语言模型仍未能满足实际应用需求。特别是在企业级会计场景的部署中,需要进一步优化,以充分实现LLMs在该领域的潜在价值。