The recent large-scale generative modeling has attained unprecedented performance especially in producing high-fidelity images driven by text prompts. Text inversion (TI), alongside the text-to-image model backbones, is proposed as an effective technique in personalizing the generation when the prompts contain user-defined, unseen or long-tail concept tokens. Despite that, we find and show that the deployment of TI remains full of "dark-magics" -- to name a few, the harsh requirement of additional datasets, arduous human efforts in the loop and lack of robustness. In this work, we propose a much-enhanced version of TI, dubbed Controllable Textual Inversion (COTI), in resolving all the aforementioned problems and in turn delivering a robust, data-efficient and easy-to-use framework. The core to COTI is a theoretically-guided loss objective instantiated with a comprehensive and novel weighted scoring mechanism, encapsulated by an active-learning paradigm. The extensive results show that COTI significantly outperforms the prior TI-related approaches with a 26.05 decrease in the FID score and a 23.00% boost in the R-precision.
翻译:近期大规模生成式模型取得了前所未有的性能,尤其在基于文本提示生成高保真图像方面表现突出。文本反转(Text Inversion, TI)结合文本到图像模型主干,被提出作为一种有效技术,用于在提示中包含用户定义、未见或长尾概念标记时实现个性化生成。然而,我们发现并表明,TI的部署仍充满"黑魔法"——例如,对额外数据集的苛刻要求、耗费人力的循环劳动以及缺乏鲁棒性。在本工作中,我们提出TI的升级版本,称为可控文本反转(Controllable Textual Inversion, COTI),旨在解决上述所有问题,并提供一个鲁棒、数据高效且易于使用的框架。COTI的核心是一种基于理论指导的损失目标函数,通过全面且新颖的加权评分机制实例化,并封装在主动学习范式中。大量结果表明,COTI显著优于先前TI相关方法,FID分数降低26.05,R-precision提升23.00%。