As Large Language Models (LLMs) continue to evolve, practitioners face increasing options for enhancing inference-time performance without model retraining, including budget tuning and multi-step techniques like self-reflection. While these methods improve output quality, they create complex trade-offs among accuracy, cost, and latency that remain poorly understood across different domains. This paper systematically compares self-reflection and budget tuning across mathematical reasoning and translation tasks. We evaluate prominent LLMs, including Anthropic Claude, Amazon Nova, and Mistral families, along with other models under varying reflection depths and compute budgets to derive Pareto optimal performance frontiers. Our analysis reveals substantial domain dependent variation in self-reflection effectiveness, with performance gains up to 220\% in mathematical reasoning. We further investigate how reflection round depth and feedback mechanism quality influence performance across model families. To validate our findings in a real-world setting, we deploy a self-reflection enhanced marketing content localisation system at Lounge by Zalando, where it shows market-dependent effectiveness, reinforcing the importance of domain specific evaluation when deploying these techniques. Our results provide actionable guidance for selecting optimal inference strategies given specific domains and resource constraints. We open source our self-reflection implementation for reproducibility at https://github.com/aws-samples/sample-genai-reflection-for-bedrock.
翻译:随着大语言模型(LLMs)的持续发展,从业者在不重新训练模型的情况下提升推理时性能的选择日益增多,包括预算调优和多步技术(如自我反思)。尽管这些方法能提高输出质量,但它们也在准确性、成本和延迟之间形成了复杂的权衡关系,且这种关系在不同领域仍缺乏深入理解。本文系统比较了数学推理和翻译任务中的自我反思与预算调优策略。我们评估了包括Anthropic Claude、Amazon Nova和Mistral系列在内的主流LLMs,以及其他模型在不同反思深度和计算预算下的表现,以推导帕累托最优性能边界。我们的分析揭示了自我反思效果存在显著的领域依赖性,在数学推理任务中性能提升最高可达220%。我们进一步研究了反思轮次深度和反馈机制质量如何影响不同模型系列的表现。为了在真实场景中验证我们的发现,我们在Zalando的Lounge平台部署了一个基于自我反思增强的营销内容本地化系统,该系统显示出市场依赖性的效果,这强化了在部署此类技术时进行领域特定评估的重要性。我们的研究结果为在给定特定领域和资源约束条件下选择最优推理策略提供了可操作的指导。我们在https://github.com/aws-samples/sample-genai-reflection-for-bedrock开源了自我反思的实现代码以确保可复现性。