Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes eleven diverse system artifacts: the Raft implementation of Etcd and Redis, the leader election of ZooKeeper, the Spinlock, Mutex, and Ringbuffer in Asterinas OS, etc., with more being added. SysMoBench enables us to understand the capabilities and limitations of today's LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.
翻译:形式化模型对于规范大型复杂计算机系统并验证其正确性至关重要,但众所周知,其编写和维护成本高昂。生成式人工智能的最新进展显示出在生成某些形式规范方面的潜力。然而,现有工作主要针对小型代码,而非完整系统。目前尚不清楚人工智能能否处理现实的系统构件,因为这需要将其复杂的行为特性抽象为形式化模型。我们提出了SysMoBench,这是一个用于评估人工智能对大型复杂系统进行形式化建模能力的基准测试。我们专注于并发与分布式系统,它们是当今关键计算基础设施(包括操作系统和云基础设施)的基石。我们采用TLA+——并发与分布式系统事实上的规范语言,尽管该基准可扩展至其他规范语言。我们通过自动化评估指标(如语法与运行时正确性、与系统代码的一致性以及不变式正确性)来解决评估人工智能生成模型的主要挑战。SysMoBench目前包含十一项多样化的系统构件:Etcd和Redis的Raft实现、ZooKeeper的领导者选举、Asterinas OS中的自旋锁、互斥锁和环形缓冲区等,并持续增加更多内容。SysMoBench使我们能够理解当前大语言模型与智能体的能力与局限,为该领域的工具奠定坚实基础,并开辟有前景的新研究方向。