SpatialBench：智能体能否分析真实世界的空间生物学数据？ (SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?)

Spatial transcriptomics assays are rapidly increasing in scale and complexity, making computational analysis a major bottleneck in biological discovery. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world spatial datasets. We introduce SpatialBench, a benchmark of 146 verifiable problems derived from practical spatial analysis workflows spanning five spatial technologies and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on frontier models shows that base model accuracy remains low (20-38% across model families), with strong model-task and model-platform interactions. Harness design has a large empirical effect on performance, indicating that tools, prompts, control flow, and execution environment should be evaluated and improved as first-class objects. SpatialBench serves both as a measurement tool and a diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly.

翻译：空间转录组学检测技术在规模和复杂性上迅速增长，使得计算分析成为生物学发现的主要瓶颈。尽管前沿AI智能体在软件工程和通用数据分析方面取得了显著进步，但它们能否从杂乱的真实世界空间数据集中提取生物学洞见仍不明确。我们提出了SpatialBench——一个包含146个可验证问题的基准测试集，这些问题源自跨越五种空间技术和七类任务的实际空间分析工作流。每个问题提供分析步骤前的实验数据快照，以及评估关键生物学结果复现性的确定性评分器。基于前沿模型的基准数据显示，基础模型准确率仍然较低（在不同模型家族中为20-38%），且存在显著的模型-任务与模型-平台交互效应。智能体架构设计对性能具有显著影响，表明工具、提示、控制流和执行环境应作为首要评估和改进对象。SpatialBench既可作为测量工具，也可作为诊断视角，用于开发能够忠实、透明且可复现地处理真实空间数据集的智能体。