We present CAIA, a benchmark exposing a critical blind spot in AI evaluation: the inability of state-of-the-art models to operate in adversarial, high-stakes environments where misinformation is weaponized and errors are irreversible. While existing benchmarks measure task completion in controlled settings, real-world deployment demands resilience against active deception. Using crypto markets as a testbed where $30 billion was lost to exploits in 2024, we evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish truth from manipulation, navigate fragmented information landscapes, and make irreversible financial decisions under adversarial pressure. Our results reveal a fundamental capability gap: without tools, even frontier models achieve only 28% accuracy on tasks junior analysts routinely handle. Tool augmentation improves performance but plateaus at 67.4% versus 80% human baseline, despite unlimited access to professional resources. Most critically, we uncover a systematic tool selection catastrophe: models preferentially choose unreliable web search over authoritative data, falling for SEO-optimized misinformation and social media manipulation. This behavior persists even when correct answers are directly accessible through specialized tools, suggesting foundational limitations rather than knowledge gaps. We also find that Pass@k metrics mask dangerous trial-and-error behavior for autonomous deployment. The implications extend beyond crypto to any domain with active adversaries, e.g. cybersecurity, content moderation, etc. We release CAIA with contamination controls and continuous updates, establishing adversarial robustness as a necessary condition for trustworthy AI autonomy. The benchmark reveals that current models, despite impressive reasoning scores, remain fundamentally unprepared for environments where intelligence must survive active opposition.
翻译:我们提出CAIA基准,旨在揭示AI评估中的一个关键盲点:当前最先进的模型无法在对抗性、高风险环境中运行,这类环境中错误信息被武器化且错误不可逆转。现有基准主要衡量受控环境下的任务完成度,而现实世界部署需要抵御主动欺骗的能力。我们以加密货币市场为测试平台(2024年该市场因漏洞攻击损失300亿美元),在178项时间锚定任务上评估了17个模型。这些任务要求智能体在对抗压力下:区分真相与操纵、驾驭碎片化信息景观、做出不可逆的财务决策。研究结果揭示了根本性的能力差距:在没有工具辅助的情况下,即使是前沿模型在初级分析师常规处理的任务上也仅能达到28%的准确率。工具增强虽能提升性能,但在无限访问专业资源的情况下,其表现仍停滞在67.4%,远低于80%的人类基线水平。最关键的是,我们发现了系统性的工具选择灾难:模型优先选择不可靠的网络搜索而非权威数据源,极易受SEO优化的错误信息和社交媒体操纵的影响。即使正确答案可通过专业工具直接获取,该行为依然持续存在,这表明问题源于基础性局限而非知识缺口。我们还发现Pass@k指标掩盖了自主部署中危险的试错行为。研究影响远不止加密货币领域,可延伸至任何存在主动对抗者的领域(如网络安全、内容审核等)。我们发布的CAIA基准包含污染控制机制和持续更新功能,将对抗鲁棒性确立为可信AI自主性的必要条件。该基准表明:尽管当前模型在推理评分上表现突出,但其本质上仍未准备好应对需要智力在主动对抗中存续的环境。