Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort to utilizing prior knowledge about bias to artificially break this uneven distribution, which only removes a limited amount of significant language biases. In this work, we propose the bias-conflict sample synthesis and adversarial removal debias strategy (BSSARD), which dynamically generates bias-conflict samples by explicitly leveraging potentially spurious correlations between single-modality features and the temporal position of the target moments. Through adversarial training, its bias generators continuously introduce biases and generate bias-conflict samples to deceive its grounding model. Meanwhile, the grounding model continuously eliminates the introduced biases, which requires it to model multi-modality alignment information. BSSARD will cover most kinds of coupling relationships and disrupt language and visual biases simultaneously. Extensive experiments on Charades-CD and ActivityNet-CD demonstrate the promising debiasing capability of BSSARD. Source codes are available at https://github.com/qzhb/BSSARD.
翻译:视频时序句子定位(TSGV)受数据集偏差问题困扰,该问题源于输入视频或查询文本中具有相似语义成分的样本,其目标时刻在时序分布上存在不均匀性。现有方法借助偏差先验知识人为打破这种不均匀分布,但仅能消除有限程度的显著语言偏差。本文提出偏向冲突样本合成与对抗去除去偏策略(BSSARD),通过显式利用单模态特征与目标时刻时序位置之间的潜在虚假相关性,动态生成偏向冲突样本。通过对抗训练,其偏向生成器持续引入偏差并生成偏向冲突样本以欺骗定位模型,同时定位模型不断消除所引入的偏差,迫使其建模多模态对齐信息。BSSARD将覆盖大多数耦合关系,并同步瓦解语言与视觉偏差。在Charades-CD和ActivityNet-CD数据集上的大量实验表明,BSSARD具有优越的去偏能力。源代码发布于https://github.com/qzhb/BSSARD。