Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.
翻译:对多次大语言模型尝试进行多数投票可改善数学推理,但相关误差限制了有效样本量。一个自然的解决方案是为不同投票器分配不同的推理策略。这种名为“多样化提示混合器”(Diverse Prompt Mixer)的方法在AIMO 3竞赛中进行了测试:涉及3个模型、23+组实验、50道IMO级问题、单块H100 80 GB GPU、5小时时限。所有提示层级干预均告失败。高温采样已能解相关误差;较弱策略降低准确率的效果大于其降低相关性的效果。在等样本量N=8且经过所有测试优化的情况下,跨8分能力差距,模型能力主导了结果。最佳多数投票得分(42/50)与pass@20(约45.5)之间的差距是选择损失,而非提示损失。基于验证器的选择器可以缩小这一差距,而提示工程无法做到。