TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.

翻译：将CLIP模型应用于未见物体的异常检测已展现出零样本学习方式的强大潜力。然而，现有方法通常依赖单一文本空间来对齐不同物体和领域的视觉语义。这种无差别的对齐方式阻碍了模型准确捕捉多样化的异常语义。我们提出了TokenCLIP，一种逐令牌自适应框架，通过视觉空间与可学习文本空间之间的动态对齐实现细粒度异常学习。与将所有视觉令牌映射到单一且与令牌无关的文本空间不同，TokenCLIP将每个令牌与其视觉特征对应的定制化文本子空间对齐。为每个令牌显式分配独立可学习文本空间在计算上不可行且易导致优化不足。我们转而将与令牌无关的文本空间扩展为一组正交子空间，随后基于语义亲和力动态分配每个令牌到子空间组合，从而协同支持定制化且高效的逐令牌自适应。为此，我们将动态对齐建模为最优传输问题，其中图像中的所有视觉令牌根据语义相似度被传输至文本子空间。最优传输的约束条件确保了子空间间的充分优化，并促使它们关注不同的语义。求解该问题可获得自适应分配每个令牌到语义相关子空间的传输方案。随后应用top-k掩码机制对方案进行稀疏化处理，使不同子空间专注于特定视觉区域。大量实验证明了TokenCLIP的优越性。