Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.
翻译:视频识别模型通常在固定的分类体系上进行训练,这些体系往往过于粗糙,将对象、方式或结果上的差异归并在单一标签下。随着任务和定义的演变,此类模型无法适应新出现的区分,而收集新标注并重新训练以适应这些变化的成本高昂。为应对这些挑战,我们引入了类别拆分这一新任务,即对现有分类器进行编辑,将粗糙类别细化为更精细的子类别,同时保持其他部分的准确性。我们提出了一种零样本编辑方法,该方法利用视频分类器的潜在组合结构来揭示细粒度区分,而无需额外数据。我们进一步表明,低样本微调虽然简单,但非常有效,并能从我们的零样本初始化中受益。在我们为类别拆分构建的新视频基准上的实验表明,我们的方法显著优于视觉语言基线,在提升新拆分类别准确性的同时,不牺牲其余部分的性能。项目页面:https://kaitingliu.github.io/Category-Splitting/。