Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.
翻译:文本到视频(T2V)扩散模型已取得快速进展,但其人口统计偏差,尤其是性别偏差,仍未得到充分探究。本文提出FairT2V,一种用于文本到视频生成的无训练去偏框架,该框架无需微调即可缓解编码器引入的偏差。我们首先分析了T2V模型中的人口统计偏差,并证明其主要源于预训练文本编码器——即使对于中性提示,这些编码器也会编码隐含的性别关联。我们通过一个与生成视频中偏差相关的性别倾向评分来量化这种效应。基于这一发现,FairT2V通过基于锚点的球面测地变换对提示嵌入进行中和处理以减轻人口统计偏差,同时保持语义信息。为维持时序一致性,我们仅通过动态去噪计划在早期身份形成阶段应用去偏操作。我们进一步提出一种结合VideoLLM推理与人工验证的视频级公平性评估方案。在现代T2V模型Open-Sora上的实验表明,FairT2V能显著降低跨职业的人口统计偏差,且对视频质量影响极小。