This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.
翻译:本研究聚焦于文本到图像(T2I)生成中的一项新任务——动作定制。该任务的目标是从有限数据中学习共现动作,并将其泛化到未见的人类甚至动物上。实验结果表明,现有主体驱动定制方法难以学习动作的代表性特征,且无法将动作与包括外观在内的上下文特征解耦。为克服对低级特征的偏好及高级特征的纠缠问题,我们提出一种基于逆映射的方法——动作解耦标识符(ADI),从示例图像中学习动作专属标识符。ADI首先通过引入层级标识符Token扩展语义条件空间,从而在将逆映射分散到不同特征的同时提升表征丰富性。接着,为阻断动作无关特征的逆映射,ADI从构建的样本三元组中提取梯度不变性,并掩蔽无关通道的更新。为全面评估该任务,我们提出包含多种动作的ActionBench基准,每个动作均配有精心筛选的样本。定量与定性结果表明,我们的ADI在动作定制T2I生成中优于现有基线方法。项目页面位于https://adi-t2i.github.io/ADI。