As large language model (LLM) assistants become increasingly integrated into enterprise workflows, their ability to generate accurate, semantically aligned, and executable outputs is critical. However, current conversational business analytics (CBA) systems often lack built-in verification mechanisms, leaving users to manually validate potentially flawed results. This paper introduces two complementary verification techniques: Q*, which performs reverse translation and semantic matching between code and user intent, and Feedback+, which incorporates execution feedback to guide code refinement. Embedded within a generator-discriminator framework, these mechanisms shift validation responsibilities from users to the system. Evaluations on three benchmark datasets, Spider, Bird, and GSM8K, demonstrate that both Q* and Feedback+ reduce error rates and task completion time. The study also identifies reverse translation as a key bottleneck, highlighting opportunities for future improvement. Overall, this work contributes a design-oriented framework for building more reliable, enterprise-grade GenAI systems capable of trustworthy decision support.
翻译:随着大型语言模型助手日益融入企业工作流程,其生成准确、语义对齐且可执行输出的能力变得至关重要。然而,当前对话式商业分析系统通常缺乏内置验证机制,导致用户需手动验证可能存在缺陷的结果。本文提出两种互补的验证技术:Q*通过代码与用户意图间的反向翻译与语义匹配进行验证,Feedback+则通过整合执行反馈来指导代码优化。这些机制嵌入生成器-判别器框架中,将验证责任从用户转移至系统。在Spider、Bird和GSM8K三个基准数据集上的评估表明,Q*与Feedback+均能有效降低错误率并缩短任务完成时间。研究同时发现反向翻译是当前主要性能瓶颈,为未来改进指明了方向。总体而言,本研究提出了一个面向设计的框架,为构建更可靠、具备可信决策支持能力的企业级生成式人工智能系统提供了解决方案。