Inference-time computation is a critical yet challenging paradigm for enhancing the reasoning performance of large language models (LLMs). While existing strategies improve reasoning stability and consistency, they suffer from notable limitations: self-correction often reinforces the model's initial biases, and Multi-Agent Collaboration (MAC) often fails due to the lack of efficient coordination mechanisms, leading to collective errors. Although high-performing verifiers can detect reasoning errors, making them reliable requires substantial training. To address these challenges, we introduce a novel inference-time framework, Adaptive Coopetition (AdCo), in which LLM agents utilize an adaptive, UCB-based "coopetition" mechanism. At each round, agents leverage coarse verifier signals to determine whether to collaborate or compete, and iteratively refine their reasoning based on peer feedback. Without relying on high-performance verifiers, our adaptive strategy achieves significant performance gains on mathematical reasoning benchmarks, yielding a 20% relative improvement over baselines on the more challenging dataset. Our approach remains robust and consistent in terms of accuracy under different sample sizes and configurations. This adaptive, signal-guided "coopetition" framework enhances reasoning robustness by leveraging both model knowledge diversity and reasoning trace measures, while also promoting uncertainty-driven exploration, especially when participants have comparable capabilities. From this perspective, our work offers a fresh lens on inference-time computation and paves the way for more resilient multi-agent LLM systems. Our code is available at: https://github.com/AdCo-Research/adaptive-coopetition.
翻译:推理时计算是提升大语言模型推理性能的关键但具有挑战性的范式。现有策略虽能改善推理的稳定性与一致性,却存在显著局限:自我纠正常强化模型的初始偏见,而多智能体协作因缺乏高效协调机制常导致集体错误。尽管高性能验证器能检测推理错误,但其可靠训练需大量成本。为应对这些挑战,我们提出一种新颖的推理时框架——自适应竞合,其中大语言模型智能体采用基于UCB的自适应“竞合”机制。在每轮迭代中,智能体利用粗糙验证器信号决定协作或竞争,并基于同伴反馈迭代优化其推理过程。无需依赖高性能验证器,我们的自适应策略在数学推理基准上实现了显著性能提升,在更具挑战性的数据集上相对基线获得20%的改进。该方法在不同样本规模与配置下均保持准确性的鲁棒性与一致性。这种信号引导的自适应“竞合”框架通过融合模型知识多样性与推理轨迹度量,同时促进不确定性驱动的探索,显著增强了推理鲁棒性——尤其在参与者能力相当时。由此,本研究为推理时计算提供了全新视角,并为构建更具韧性的多智能体大语言模型系统开辟了道路。代码已开源:https://github.com/AdCo-Research/adaptive-coopetition。