Recently text-to-video generation has made impressive progress in producing short, high-quality clips, but evaluating long-form outputs remains a major challenge especially when processing complex prompts. Existing benchmarks mostly rely on simplified prompts and focus on low-level metrics, overlooking fine-grained alignment with prompts and abstract dimensions such as narrative coherence and thematic expression. To address these gaps, we propose LoCoT2V-Bench, a benchmark specifically designed for long video generation (LVG) under complex input conditions. Based on various real-world videos, LoCoT2V-Bench introduces a suite of realistic and complex prompts incorporating elements like scene transitions and event dynamics. Moreover, it constructs a multi-dimensional evaluation framework that includes our newly proposed metrics such as event-level alignment, fine-grained temporal consistency, content clarity, and the Human Expectation Realization Degree (HERD) that focuses on more abstract attributes like narrative flow, emotional response, and character development. Using this framework, we conduct a comprehensive evaluation of nine representative LVG models, finding that while current methods perform well on basic visual and temporal aspects, they struggle with inter-event consistency, fine-grained alignment, and high-level thematic adherence, etc. Overall, LoCoT2V-Bench provides a comprehensive and reliable platform for evaluating long-form complex text-to-video generation and highlights critical directions for future method improvement.
翻译:近期,文本到视频生成技术在生成短时长、高质量视频片段方面取得了显著进展,但评估长篇幅输出仍是一个重大挑战,尤其是在处理复杂提示时。现有基准测试大多依赖简化的提示,并侧重于低层次指标,忽略了与提示的细粒度对齐以及叙事连贯性、主题表达等抽象维度。为弥补这些不足,我们提出了LoCoT2V-Bench,这是一个专门为复杂输入条件下的长视频生成(LVG)设计的基准测试。基于多样化的真实世界视频,LoCoT2V-Bench引入了一套现实且复杂的提示,包含场景转换、事件动态等元素。此外,它构建了一个多维评估框架,涵盖了我们新提出的指标,如事件级对齐、细粒度时序一致性、内容清晰度,以及关注叙事流畅性、情感响应和角色发展等更抽象属性的人类期望实现度(HERD)。利用该框架,我们对九种代表性LVG模型进行了全面评估,发现尽管现有方法在基础视觉和时序方面表现良好,但在事件间一致性、细粒度对齐和高层次主题遵循等方面仍存在困难。总体而言,LoCoT2V-Bench为评估长篇幅复杂文本到视频生成提供了一个全面可靠的平台,并指明了未来方法改进的关键方向。