Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: ingredients, cooking, presentation, and eating. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. Concerning the classic datasets, GPE also yields comparable performance as previous arts. The code is available at: https://github.com/ForeverPs/IncrementalVHD_GPE.
翻译:视频高光检测(VHD)是计算机视觉中一个活跃的研究领域,旨在从原始视频输入中定位最吸引用户的片段。然而,大多数VHD方法基于封闭世界假设,即预先定义固定数量的高光类别,且所有训练数据均可事先获取。因此,现有方法在应对不断增长的高光域和训练数据时扩展性较差。为解决上述问题,我们提出了一种名为全局原型编码(GPE)的新型视频高光检测方法,通过参数化原型实现对新域的增量学习。为促进这一新研究方向,我们收集了名为LiveFood的精细标注数据集,包含5100余个涵盖配料、烹饪、摆盘和食用四个域的美食直播视频。据我们所知,这是首个在增量学习框架下探索视频高光检测的研究,为高光域和训练数据随时间增长的实际应用场景开辟了新路径。通过大量实验验证了GPE的有效性。值得注意的是,GPE在LiveFood上超越了主流域增量学习方法,在所有域上均实现了显著的mAP提升。在经典数据集上,GPE也取得了与先前方法相当的性能。代码开源地址:https://github.com/ForeverPs/IncrementalVHD_GPE。