Recent diffusion-based subject driven generative methods have enabled image generations with good fidelity for specific objects or human portraits. However, to achieve better versatility for applications, we argue that not only improved datasets and evaluations are desired, but also more careful methods to retrieve only relevant information from conditional images are anticipated. To this end, we propose an anime figures dataset RetriBooru-V1, with enhanced identity and clothing labels. We state new tasks enabled by this dataset, and introduce a new diversity metric to measure success in completing these tasks, quantifying the flexibility of image generations. We establish an RAG-inspired baseline method, designed to retrieve precise conditional information from reference images. Then, we compare with current methods on existing task to demonstrate the capability of the proposed method. Finally, we provide baseline experiment results on new tasks, and conduct ablation studies on the possible structural choices.
翻译:基于扩散的主体驱动生成方法已能够针对特定对象或人物肖像生成高保真图像。然而,我们认为,为了实现更好的应用通用性,不仅需要改进数据集和评估方法,还需要更精细的方法从条件图像中仅提取相关信息。为此,我们提出动漫人物数据集RetriBooru-V1,其增强了身份与服装标签。我们阐明了该数据集支持的新任务,并引入新的多样性指标来衡量完成这些任务的成功程度,从而量化图像生成的灵活性。我们建立了一种受RAG启发的基线方法,旨在从参考图像中精确提取条件信息。随后,我们在现有任务上比较了当前方法,以证明所提方法的性能。最后,我们提供了新任务的基线实验结果,并就可能的架构选择进行了消融研究。