Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.
翻译:生成式人工智能的最新进展实现了自然语言驱动的图像编辑,但现有系统在处理包含多个交互对象的复杂场景时往往表现不佳,因为它们高度依赖用户精心构建精确的文本提示。为解决缺乏结构化控制的问题,我们提出SceneCraft,一种新颖的交互式框架,通过将图像表示为可编辑的场景图来桥接用户意图与模型执行。用户无需通过反复尝试来猜测文本提示,而是直接与可视化图形交互,执行复杂的空间和关系操作。这些图形修改会自动转化为精确的、上下文感知的编辑提示,有效消除语言歧义。为确保结果的鲁棒性和多样性,结构化提示会被分发给多个最先进的生成模型。在多种编辑场景下的评估表明,SceneCraft提供了更直观的控制机制,显著减轻了手动提示工程带来的认知负担,同时生成的输出在质量和保真度方面始终获得用户更高评价。