We propose VC-Inspector, a lightweight, open-source large multimodal model (LMM) for reference-free evaluation of video captions, with a focus on factual accuracy. Unlike existing metrics that suffer from limited context handling, weak factuality assessment, or reliance on proprietary services, VC-Inspector offers a reproducible and fact-aware alternative that aligns closely with human judgments. To enable robust training and interpretable evaluation, we introduce a systematic framework for generating captions with controllable factual errors, paired with graded quality scores and explanatory annotations. Experiments demonstrate that VC-Inspector achieves state-of-the-art correlation with human judgments, generalizing across diverse domains (e.g., VATEX-Eval, Flickr8K-Expert, and Flickr8K-CF benchmarks) and revealing the potential for caption improvement. Project page is available at https://dipta007.github.io/VC-Inspector
翻译:我们提出VC-Inspector——一种轻量级开源大型多模态模型,专用于视频字幕的无参考评估,重点聚焦事实准确性。与现有指标受限于上下文处理能力不足、事实评估薄弱或依赖专有服务不同,VC-Inspector提供可复现且具事实感知能力的替代方案,其评估结果与人类判断高度一致。为实现稳健训练与可解释评估,我们构建了系统性框架,可生成包含可控事实错误的字幕,并配以分级质量得分与说明性注释。实验表明,VC-Inspector在人类判断相关性上达到最优水平,在VATEX-Eval、Flickr8K-Expert和Flickr8K-CF等跨领域基准测试中均具备泛化能力,并展现出字幕优化的潜力。项目主页:https://dipta007.github.io/VC-Inspector