Movies are long-form audiovisual works, yet recommender benchmarks often rely on trailers, thumbnails, or metadata. These sources differ in semantics and scalability: full movies preserve consumption-level evidence, trailers concentrate promotional highlights, and thumbnails provide sparse but catalog-scale visual signals. We present Popcorn, a configurable benchmark for visual evidence in multimodal movie recommendation, combining title-aligned full-movie/trailer embeddings with MovieLens-linked thumbnail features encoded by modern visual and vision-language models. Popcorn standardizes modality assembly, fusion, splitting, evaluation, and LLM-augmented metadata through a single configuration contract. Experiments show that thumbnail VLMs provide strong, scalable item-side evidence, while controlled trailer/full-movie comparisons show that visual evidence sources are not interchangeable: the choice of source and fusion strategy affects ranking accuracy, coverage, diversity, and calibration. The framework is available at https://github.com/RecSys-lab/Popcorn.
翻译:电影属于长篇幅视听作品,但推荐基准测试常依赖预告片、缩略图或元数据。这些源在语义和可扩展性上存在差异:完整电影保留消费级证据,预告片聚焦推广亮点,而缩略图虽稀疏但可提供目录级视觉信号。我们提出Popcorn——一个用于多模态电影推荐中视觉证据的可配置基准,将标题对齐的完整电影/预告片嵌入与由现代视觉和视觉-语言模型编码的MovieLens关联缩略图特征相结合。Popcorn通过单一配置契约标准化了模态组装、融合、分割、评估以及经大语言模型增强的元数据。实验表明,缩略图视觉语言模型提供了强效且可扩展的物证,而受控的预告片/完整电影对比显示视觉证据源不可互换:源与融合策略的选择会影响排序正确率、覆盖率、多样性与校准性。该框架已开源至https://github.com/RecSys-lab/Popcorn。