Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50\%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.
翻译:近期,以Gen2、Pika和Sora等模型为代表的文本到视频(T2V)技术进展显著拓宽了其应用范围并提升了普及度。尽管取得了这些进步,评估这些模型仍面临重大挑战。主要原因是,由于自动指标的固有局限性,人工评估通常被认为是评估T2V生成质量的更优方法。然而,现有的人工评估协议存在可复现性、可靠性和实用性问题。为应对这些挑战,本文提出了文本到视频人工评估(T2VHE)协议,这是一个针对T2V模型的全面标准化协议。T2VHE协议包含明确定义的指标、详尽的标注员培训以及一个高效的动态评估模块。实验结果表明,该协议不仅能确保高质量的标注,还能将评估成本降低近50%。我们将开源T2VHE协议的全部设置,包括完整的协议工作流程、动态评估组件细节以及标注界面代码。这将有助于相关社区建立更完善的人工评估协议。