DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition

As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.

翻译：开放词汇动作识别（OVAR）作为计算机视觉中的基础视频任务之一，近年来随着视觉-语言预训练的发展而日益受到关注。为实现任意类别的泛化，现有方法将类别标签视为文本描述，然后将OVAR建模为视觉样本与文本类别之间的嵌入相似度评估。然而，一个关键问题被完全忽略：用户给出的类别描述可能存在噪声（如拼写错误和笔误），这限制了原始OVAR在实际应用中的可行性。为填补这一研究空白，本文首次通过模拟多种类型的多级噪声来评估现有方法，并揭示了其较差的鲁棒性。为解决带噪声的OVAR任务，我们进一步提出了一种新颖的DENOISER框架，包含生成与判别两个部分。具体而言，生成部分通过解码过程对噪声类别文本名称进行去噪——即提出文本候选集，并利用模态间与模态内信息投票选出最佳候选。在判别部分，我们使用原始OVAR模型将视觉样本分配给类别文本名称，从而获取更多语义信息。为优化模型，我们交替迭代生成与判别部分以实现逐步改进。去噪后的文本类别有助于OVAR模型更准确地分类视觉样本；反之，分类后的视觉样本也有助于更好的去噪。在三个数据集上，我们进行了大量实验以展示其卓越的鲁棒性，并通过全面的消融实验剖析了每个组件的有效性。