当幻觉代价百万：高风险对抗性金融市场中AI智能体的基准测试 (When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets)

We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition.

翻译：我们提出了CAIA基准测试，旨在揭示AI评估中的一个关键盲点：当前最先进的模型无法在对抗性、高风险环境中运行，其中错误信息被武器化且错误不可逆转。现有基准测试主要衡量受控环境下的任务完成度，而实际部署需要抵御主动欺骗的能力。以加密货币市场作为试验场（2024年因漏洞攻击损失达300亿美元），我们在178项时间锚定任务上评估了17个模型，要求智能体在对抗压力下区分真相与操纵、驾驭碎片化信息环境并做出不可逆的财务决策。研究结果揭示了根本性的能力差距：在没有工具辅助的情况下，即使前沿模型在初级分析师常规处理的任务上也仅能达到28%的准确率。工具增强虽能提升性能，但在无限获取专业资源的前提下，其表现仍停滞在67.4%，远低于80%的人类基线水平。最关键的是，我们发现了系统性的工具选择灾难：模型优先选择不可靠的网络搜索而非权威数据源，容易落入SEO优化的错误信息和社交媒体操纵的陷阱。即使正确答案可通过专业工具直接获取，该行为依然持续存在，这表明问题源于基础性缺陷而非知识缺口。我们还发现Pass@k指标掩盖了自主部署中危险的试错行为。研究影响远不止于加密货币领域，可延伸至任何存在主动对抗者的领域（如网络安全、内容审核等）。我们发布的CAIA基准测试包含污染控制机制与持续更新功能，将对抗鲁棒性确立为可信AI自主性的必要条件。该基准测试表明，尽管当前模型在推理评分上表现突出，但其本质上仍未准备好应对需要智力在主动对抗中存续的环境。