Text-to-image generation has advanced rapidly, yet aligning complex textual prompts with generated visuals remains challenging, especially with intricate object relationships and fine-grained details. This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach, enhancing text-to-image alignment efficiency without the iterative overhead typical of current methods like OPT2I. FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts to enable real-time inference, reducing computational demands while preserving alignment fidelity. Extensive evaluations on the COCO Captions and PartiPrompts datasets demonstrate that FPA achieves competitive text-image alignment scores at a fraction of the processing time, as validated through both automated metrics (TIFA, VQA) and human evaluation. A human study with expert annotators further reveals a strong correlation between human alignment judgments and automated scores, underscoring the robustness of FPA's improvements. The proposed method showcases a scalable, efficient alternative to iterative prompt optimization, enabling broader applicability in real-time, high-demand settings. The codebase is provided to facilitate further research: https://github.com/tiktok/fast_prompt_alignment
翻译:文本到图像生成技术发展迅速,但将复杂文本提示与生成视觉内容对齐仍具挑战性,尤其是在处理精细对象关系和细粒度细节时。本文提出快速提示对齐(FPA),一种利用单次处理方法的提示优化框架,在无需当前方法(如OPT2I)典型迭代开销的情况下,提升了文本到图像对齐效率。FPA采用大型语言模型(LLMs)进行单次迭代提示改写,随后通过优化提示进行微调或上下文学习,实现实时推理,在保持对齐保真度的同时降低计算需求。在COCO Captions和PartiPrompts数据集上的广泛评估表明,FPA以极少的处理时间实现了具有竞争力的文本-图像对齐分数,这通过自动指标(TIFA、VQA)和人工评估均得到验证。一项由专家标注员参与的人工研究进一步揭示了人类对齐判断与自动分数之间的强相关性,强调了FPA改进的鲁棒性。所提方法展示了一种可扩展、高效的迭代提示优化替代方案,使其在实时、高需求场景中具有更广泛的适用性。代码库已开源以促进进一步研究:https://github.com/tiktok/fast_prompt_alignment