The potential data contamination issue in contemporary large language models (LLMs) benchmarks presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, they predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce \textsc{Squid Game}, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, including instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition in performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. We also compare prominent LLM benchmarks and \textsc{Squid Game}, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. Project page: https://github.com/zijianchen98/LLM_Squid_Game.
翻译:当代大型语言模型(LLM)基准测试中潜在的数据污染问题,对建立可信赖的评估框架构成了根本性挑战。同时,这些基准测试主要假设了良性、资源丰富的环境,使得LLM在压力下的行为未被充分探索。本文中,我们引入了\textsc{Squid Game},这是一个动态且对抗性的评估环境,其精心设计了资源受限和非对称信息设置,旨在通过与其他LLM对手的交互式游戏玩法来评估LLM。Squid Game包含六个淘汰赛式关卡,侧重于多方面的能力,包括指令遵循、代码、推理、规划以及安全对齐。我们在Squid Game上评估了超过50个LLM,呈现了迄今为止在动态对抗场景下对通用LLM进行的最大规模行为评估研究。我们观察到同一模型系列中性能存在明显的代际相变,并发现证据表明一些模型诉诸投机性捷径来赢得游戏,这暗示了静态基准测试中可能存在更高级别的评估范式污染。我们还比较了主流的LLM基准测试与\textsc{Squid Game},强调动态评估可以作为静态评估的补充部分。项目页面:https://github.com/zijianchen98/LLM_Squid_Game。