Large language models are now used daily for writing, search, and analysis, and their natural language understanding continues to improve. However, they remain unreliable on exact numerical calculation and on producing outputs that are straightforward to audit. We study synthetic payroll system as a focused, high-stakes example and evaluate whether models can understand a payroll schema, apply rules in the right order, and deliver cent-accurate results. Our experiments span a tiered dataset from basic to complex cases, a spectrum of prompts from minimal baselines to schema-guided and reasoning variants, and multiple model families including GPT, Claude, Perplexity, Grok and Gemini. Results indicate clear regimes where careful prompting is sufficient and regimes where explicit computation is required. The work offers a compact, reproducible framework and practical guidance for deploying LLMs in settings that demand both accuracy and assurance.
翻译:大型语言模型现已广泛应用于日常写作、搜索与分析任务,其自然语言理解能力持续提升。然而,在精确数值计算及生成易于审计的输出方面,这些模型仍存在不可靠性。本研究以合成薪资系统作为聚焦的高风险案例,评估模型能否理解薪资架构、按正确顺序应用规则并输出分位精度的结果。实验涵盖从基础到复杂案例的分层数据集、从最小基线到架构引导与推理变体的提示词谱系,以及包括GPT、Claude、Perplexity、Grok和Gemini在内的多模型家族。结果表明:在特定场景下精细提示即可满足需求,而在其他场景则需显式计算。本研究为在要求精确性与可验证性的场景中部署大型语言模型,提供了一个紧凑可复现的框架与实践指导。