Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

翻译：音频驱动说话头生成技术已在AniTalker、FLOAT、Sonic等近期模型上取得显著进展。然而，现有方法大多依赖单张静态参考图像作为条件，在推理阶段控制整个视频生成过程。这种静态条件范式常导致固定身份特征与动态变化的面部运动之间存在失配，从而引发身份漂移、时序不一致性及感知质量下降。本文提出测试时自适应条件调节（TT-SAC），一种无需参数更新的推理框架，可使预训练说话头生成器在推理过程中自主调节其条件表征，无需重训练、梯度更新或额外监督。TT-SAC并非将参考肖像视为不可变输入，而是通过构建生成器与编码器的反馈环路：将生成器自身输出重新编码，构建更契合合成序列时序动态特性的精细化条件表征。单步自适应操作即可逼近生成过程的自洽均衡态，实现身份特征与动作在时间轴上的稳定。我们进一步通过理论分析证明，在温和的李普希茨假设下，测试时条件自适应可降低特征方差并提升生成稳定性，同时展现出控制最优自适应强度的原则性偏差-方差权衡。在多个先进说话头生成器与基准数据集上的大量实验表明，该方法在唇形同步精度、时序连贯性、身份保持及感知保真度方面均实现持续提升。TT-SAC提供了一种模型无关且无需训练的策略以增强生成式视频模型，将测试时条件自适应确立为稳定音频驱动肖像动画的有效机制。