We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible, fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic approach for generating captions with controllable errors, paired with graded quality scores and explanatory annotations. Experiments show that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement.
翻译:我们提出VC-Inspector,一个轻量级、开源的大型多模态模型(LMM),用于视频描述的无参考评估,重点关注事实准确性。与现有指标相比,VC-Inspector克服了上下文处理有限、事实性评估薄弱或依赖专有服务等局限,提供了一种可复现、具备事实感知能力的替代方案,其评估结果与人类判断高度一致。为实现鲁棒的训练和可解释的评估,我们引入了一种系统化方法,用于生成具有可控错误的描述,并配以分级质量分数和解释性标注。实验表明,VC-Inspector在多个领域(例如VATEX-Eval、Flickr8K-Expert和Flickr8K-CF基准测试)均实现了与人类判断最先进的相关性,并展现出提升描述质量的潜力。