We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.
翻译:我们提出iTRIALSPACE,这是一个用于肺CT模型可控评估的可编程评估框架。标准基准评测采用静态回顾性数据集,其混杂了病灶大小、肺叶分布、解剖结构和采集背景等因素,导致难以确定模型准确性的结构性驱动因素。iTRIALSPACE通过四阶段流水线(多数据集结节特征提取、显式试验规范、解剖感知掩膜插入以及ControlNet条件CT合成)将临床真实CT与病灶特征组合为受控虚拟病灶试验,从而解决了这一局限。该框架基于包含来自七个公开CT源的13,140个标注结节的统一54属性结节特征数据集构建,并实例化为13种试验模式。我们在涵盖三种医学视觉语言模型、四种空间引导条件和三项临床任务的55,469样本虚拟病灶研究中评估了iTRIALSPACE。在所有13种模式下,合成基底均保持在真实-真实FID基线范围内,且合成性能排序与真实临床数据高度相关(ρ=0.93,p<10^{-15})。受控试验模式揭示了固定分布基准无法发现的现象,包括在肺叶均衡采样下出现的捷径驱动型尺寸预测崩溃,以及双交叉分析中宿主-供体方差比达到8.9倍和3.3倍。这些结果表明iTRIALSPACE可作为超越静态回顾性基准、支持可控且可证伪测试的可审计评估基础设施。