MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

翻译：随着数字娱乐内容的爆炸式增长，自动化视频摘要技术对于内容索引、个性化推荐和高效媒体归档等应用场景已变得不可或缺。针对电影和电视剧等长视频的自动剧情摘要生成，对现有的视觉-语言模型构成了重大挑战。尽管这些通用模型在单图像描述任务上表现出色，但在长时程语境中常常出现关键性失效，主要表现为缺乏身份一致的角色识别能力和断裂的叙事连贯性。为克服这些局限，我们提出MovieTeller——一种通过工具增强的渐进式抽象生成电影摘要的新框架。我们的核心贡献在于无需训练、工具增强且事实锚定的生成流程。该框架无需昂贵的模型微调，而是以即插即用的方式直接利用现有成熟模型。我们首先调用专用人脸识别模型作为外部“工具”来建立事实锚点——精确的角色身份及其对应边界框。这些锚点随后被注入提示词中以引导VLM的推理过程，确保生成的场景描述锚定于可验证的事实。此外，我们的渐进式抽象流程将全长电影的摘要分解为多阶段处理，有效缓解了当前VLM的上下文长度限制。实验表明，相较于端到端基线方法，我们的方案在事实准确性、角色一致性和整体叙事连贯性方面均取得显著提升。