Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.
翻译:虚拟摄影要求智能体进入一个预构建的三维场景,无需预设相机位姿或参考图像,而是根据场景信息和语言意图推断合适的拍摄方案,选择可执行的相机参数,并最终渲染出照片。近年来视觉-语言模型的进展使此类空间智能体愈发可行,但该任务对两种能力提出了协同评估的挑战:复杂三维空间理解与抽象审美判断。我们提出PhotoFlow——一种用于闭环相机搜索的“导演-评审-反思”智能体架构。导演模块构建软性摄影蓝图并生成多样化的候选相机;评审模块融合规则检查、视觉批判与成对优胜者选择;反思模块将失败经验转化为区域记忆、死区抑制与高探索重定位。同时我们创建VPhotoBench基准,包含47个开源Blender场景和141项语言条件摄影任务,涵盖主体布局、关系构图及氛围/风格等维度。在预留实验中,基于六轮渲染预算,PhotoFlow在一步预测、单链反思、锚点库选择与随机搜索等方法中取得最优的外部质量-对齐复合指标与成功率。据我们所知,这是首个将任意Blender场景中的语言条件虚拟摄影转化为可执行智能体任务的工作,实验结果表明,以LLM为核心的空间智能体在同时挑战三维推理与审美选择的设定中,已能生成高质量照片。