Semantic IDs represent items as shared discrete token sequences and have become a practical tool for recommendation and retrieval. Yet it remains difficult to tell why a tokenizer fails: poor quality may come from codebook underutilization, unstable decision boundaries, or geometric distortion of the embedding space. This paper develops a quantitative framework for diagnosing these failures through expected codeword overlap and effective codebook capacity. The former measures expected codeword confusion under retrieval-time perturbation, while the latter converts that confusion into an effective number of usable, well-separated codes. The framework links semantic boundary confusion to both code usage imbalance and Euclidean geometric constraints. As a proof of concept, we present Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a large-scale industrial dataset show that Semantic ID quality is multi-objective: symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different aspects of a tokenizer. These downstream observations are based on one proprietary industrial dataset, so they should be read as a case study rather than a universal benchmark claim.
翻译:语义ID将项目表示为共享的离散令牌序列,已成为推荐与检索中的实用工具。然而,揭示分词器失效的根源仍具挑战性:低质量可能源于码本利用率不足、决策边界不稳定或嵌入空间的几何畸变。本文构建了一个定量诊断框架,通过期望码字重叠和有效码本容量来分析这些失效。前者度量检索扰动下码字的期望混淆度,后者则将这种混淆度转化为可用且可分离的码字有效数量。该框架将语义边界混淆同时归因于码字使用不均衡与欧氏几何约束。作为概念验证,我们提出了解耦残差量化(DRQ),该方法将连续几何重建与离散分布匹配分离。在工业级大规模数据集上的实验表明,语义ID质量具有多目标性:符号鲁棒性、重建保真度以及行为感知软匹配分别侧重分词器的不同维度。这些下游观测基于单一专有工业数据集,因此应视为案例研究而非通用基准结论。