Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution

Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism. We evaluate whether multi-agent LLM architectures can improve oracle resolution accuracy over single-model baselines. We compare independent aggregation and deliberative consensus against single-LLM baselines (GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence-weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single-model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529-0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi-agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI-human oracle systems: auto-resolving only unanimous, high-confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter-agent disagreement flagging the remainder for human review.

翻译：预测市场通过聚合集体智慧来预测不确定性事件，但其效能依赖于可靠的结果仲裁。现有预言机系统在快速但脆弱的自动化机制与准确但昂贵的人工仲裁之间取舍。单一大语言模型预言机虽能达到有效准确率，但会继承其基础模型的所有故障模式，且缺乏自我修正机制。我们评估了多智能体大语言模型架构是否能在仲裁准确率上超越单一模型基线。在KalshiBench数据集的1,189个已结算预测市场问题上，我们比较了独立聚合机制与协商共识机制，并以GPT-5 Nano、DeepSeek V3和Llama-3.3-70B作为单模型基线。所有智能体通过Exa共享统一证据层，检索结果按发表日期过滤以隔离推理能力与检索质量的影响。采用置信度加权投票的独立聚合机制以83.43%的准确率表现最优，超过最佳单模型1.01个百分点。协商共识机制将准确率降低至约76%，低于所有单模型基线，原因可归咎于辩论过程中的错误传播——自信但错误的模型会倒逼正确模型改变判断。模型间错误相关性（0.529-0.689）解释了为何聚合增益低于理论上的孔多塞上限，这构成了集成方法的根本局限性。大量问题无法通过任何多智能体架构修正，需升级至人工仲裁。我们提出混合型AI-人类预言机系统的路由准则：仅对全体一致且高置信度问题实施自动仲裁，可在数据集47%的样本上实现97.87%准确率，剩余问题由智能体间分歧标记后交由人类审核。