Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.
翻译:人工智能的最新进展引发了业界对能够在受监管领域(如金融和医疗)的表格数据工作流中支持分析师的工业智能体的兴趣。此类系统的关键能力是在确保敏感信息始终不离开安全的本地环境的前提下,对结构化数据执行精确的算术运算。本文提出了一种用于算术推理的误差驱动优化框架,该框架增强了一个代码生成智能体,并特别应用于本地部署的小型语言模型。通过对领先的SLM(Qwen3 4B)进行系统评估,我们发现,虽然基础模型在算术任务上表现出根本性局限,但我们提出的误差驱动方法——通过聚类错误预测以迭代优化提示规则——显著提升了性能,将模型准确率提升至70.8%。我们的结果表明,开发可靠、可解释且可工业部署的AI助手,不仅可以通过成本高昂的微调实现,还能通过系统性的误差驱动提示优化达成,从而使小型模型能够以符合隐私要求的方式超越大型语言模型(如GPT-3.5 Turbo)。