Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.
翻译:持续文本-视频检索(CTVR)是一项具有挑战性的多模态持续学习任务,要求模型在增量学习新语义类别的同时,保持对已学习类别文本-视频对齐的准确性,因此极易遭受灾难性遗忘。CTVR的核心挑战在于特征漂移,其表现为两种形式:由各模态内部持续学习引起的模态内特征漂移,以及跨模态非协同特征漂移所导致的模态失准。为缓解这些问题,我们提出StructAlign——一种面向CTVR的结构化跨模态对齐方法。首先,StructAlign引入单纯形等角紧框架(ETF)几何作为统一几何先验以缓解模态失准。基于该几何先验,我们设计了跨模态ETF对齐损失,将文本与视频特征对齐至类别级ETF原型,促使学习到的表征形成近似单纯形ETF几何结构。此外,为抑制模态内特征漂移,我们设计了跨模态关系保持损失,该损失利用互补模态保持跨模态相似性关系,为特征更新提供稳定的关系监督。通过联合处理跨模态非协同特征漂移与模态内特征漂移,StructAlign有效缓解了CTVR中的灾难性遗忘。在基准数据集上的大量实验表明,本方法持续优于最先进的持续检索方法。