寻找最佳平衡点：推理时大语言模型反思中的质量、成本与速度权衡 (Finding the Sweet Spot: Trading Quality, Cost, and Speed During Inference-Time LLM Reflection)

As Large Language Models (LLMs) continue to evolve, practitioners face increasing options for enhancing inference-time performance without model retraining, including budget tuning and multi-step techniques like self-reflection. While these methods improve output quality, they create complex trade-offs among accuracy, cost, and latency that remain poorly understood across different domains. This paper systematically compares self-reflection and budget tuning across mathematical reasoning and translation tasks. We evaluate prominent LLMs, including Anthropic Claude, Amazon Nova, and Mistral families, along with other models under varying reflection depths and compute budgets to derive Pareto optimal performance frontiers. Our analysis reveals substantial domain dependent variation in self-reflection effectiveness, with performance gains up to 220\% in mathematical reasoning. We further investigate how reflection round depth and feedback mechanism quality influence performance across model families. To validate our findings in a real-world setting, we deploy a self-reflection enhanced marketing content localisation system at Lounge by Zalando, where it shows market-dependent effectiveness, reinforcing the importance of domain specific evaluation when deploying these techniques. Our results provide actionable guidance for selecting optimal inference strategies given specific domains and resource constraints. We open source our self-reflection implementation for reproducibility at https://github.com/aws-samples/sample-genai-reflection-for-bedrock.

翻译：随着大语言模型（LLMs）的持续发展，从业者在不重新训练模型的情况下提升推理时性能的选择日益增多，包括预算调优和多步技术（如自我反思）。尽管这些方法能提高输出质量，但它们也在准确性、成本和延迟之间形成了复杂的权衡关系，且这种关系在不同领域仍缺乏深入理解。本文系统比较了数学推理和翻译任务中的自我反思与预算调优策略。我们评估了包括Anthropic Claude、Amazon Nova和Mistral系列在内的主流LLMs，以及其他模型在不同反思深度和计算预算下的表现，以推导帕累托最优性能边界。我们的分析揭示了自我反思效果存在显著的领域依赖性，在数学推理任务中性能提升最高可达220%。我们进一步研究了反思轮次深度和反馈机制质量如何影响不同模型系列的表现。为了在真实场景中验证我们的发现，我们在Zalando的Lounge平台部署了一个基于自我反思增强的营销内容本地化系统，该系统显示出市场依赖性的效果，这强化了在部署此类技术时进行领域特定评估的重要性。我们的研究结果为在给定特定领域和资源约束条件下选择最优推理策略提供了可操作的指导。我们在https://github.com/aws-samples/sample-genai-reflection-for-bedrock开源了自我反思的实现代码以确保可复现性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日