Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50\%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.
翻译:近期,以Gen2、Pika和Sora等模型为代表的文本到视频(T2V)技术进展,极大地拓展了其应用范围并提升了普及度。尽管取得了这些进步,评估这些模型仍面临重大挑战。主要由于自动指标的固有局限性,人工评估通常被认为是评估T2V生成质量的更优方法。然而,现有人工评估协议存在可复现性、可靠性和实用性问题。为应对这些挑战,本文提出了文本到视频人工评估(T2VHE)协议——一个针对T2V模型的全面标准化评估框架。该协议包含明确定义的评估指标、系统的标注员培训机制以及高效的动态评估模块。实验结果表明,该协议不仅能确保高质量标注,还能将评估成本降低近50%。我们将开源T2VHE协议的全部配置,包括完整协议工作流程、动态评估组件细节及标注界面代码。这将助力相关领域建立更完善的人工评估体系。