We present READMem (Robust Embedding Association for a Diverse Memory), a modular framework for semi-automatic video object segmentation (sVOS) methods designed to handle unconstrained videos. Contemporary sVOS works typically aggregate video frames in an ever-expanding memory, demanding high hardware resources for long-term applications. To mitigate memory requirements and prevent near object duplicates (caused by information of adjacent frames), previous methods introduce a hyper-parameter that controls the frequency of frames eligible to be stored. This parameter has to be adjusted according to concrete video properties (such as rapidity of appearance changes and video length) and does not generalize well. Instead, we integrate the embedding of a new frame into the memory only if it increases the diversity of the memory content. Furthermore, we propose a robust association of the embeddings stored in the memory with query embeddings during the update process. Our approach avoids the accumulation of redundant data, allowing us in return, to restrict the memory size and prevent extreme memory demands in long videos. We extend popular sVOS baselines with READMem, which previously showed limited performance on long videos. Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences. Our code is publicly available.
翻译:本文提出READMem(面向多样化记忆的鲁棒嵌入关联),一种用于半自动视频对象分割(sVOS)方法的模块化框架,旨在处理非约束视频。当代sVOS方法通常将视频帧聚合到不断扩展的记忆中,导致长期应用需要高硬件资源。为降低存储需求并避免近邻对象重复(由相邻帧信息引起),先前方法引入了一个超参数控制可存储帧的选取频率。该参数需根据具体视频特性(如外观变化速率和视频长度)进行调整,泛化性能不佳。为此,我们仅在新帧嵌入能增加记忆内容多样性时才将其整合至记忆库。此外,我们在更新过程中提出将记忆存储的嵌入与查询嵌入进行鲁棒关联。该方法避免了冗余数据积累,从而可限制记忆容量,防止长视频场景下的极端存储需求。我们将READMem扩展至此前在长视频中表现有限的流行sVOS基线方法,在长视频数据集(LV1)上取得具有竞争力的结果,同时不影响短序列上的性能。我们的代码已公开。