SIDInspector: A Mapping-First Diagnostic Resource for Semantic-ID Tokenizers

Semantic-ID (\sid) tokenizers are increasingly reused as standalone artifacts in generative recommendation: an exported item-to-code mapping becomes the address space that a later sequence generator must use. These mappings rarely come with a common inspection interface, so coverage gaps, full-code aliasing, behaviorally weak prefixes, tail compression, and prefix fan-out are often found only after downstream training. We present \tool, a mapping-first diagnostic resource for \sid tokenizer artifacts. \tool defines a small adapter contract over item mappings, metadata, interactions, and optional generator traces; validates the contract; and reports mapping-level probes for utilization, aliasing, neighborhood alignment, popularity allocation, and structural cost, with hooks for temporal churn and generator traces. \tool reports inspectable artifact profiles before downstream leaderboard scores. The released resource covers four tokenizer artifact lines: a same-item GRID/RQ-KMeans-style and ReSID/GAOQ contrast on 23,742 Musical items, plus released LETTER and LC-Rec item-index artifacts. In the Musical contrast, the GRID-style feature-text export has 3,749 unique full codes and a 0.977 full-code aliasing rate, while ReSID/GAOQ is aliasing-free in its exported mapping. Yet the strongest prefix--co-occurrence alignment comes from a deterministic category-prefix control, not from either learned export row (0.447 versus 0.154 and 0.055--0.080), showing that addressability and behaviorally meaningful prefixes should be inspected separately. Cross-domain, fixed-reranker, and mechanism-probe checks support the same diagnostic direction: prefix alignment is a candidate-exposure signal, while final ranking quality remains a downstream model question.

翻译：语义ID（\sid）分词器在生成式推荐中日益被复用为独立工件：导出的项目-代码映射成为后续序列生成器必须使用的地址空间。由于这些映射缺乏通用的检查接口，覆盖缺口、全码别名化、行为弱前缀、尾部压缩和前缀扇出等问题通常仅在下游训练后才被发现。我们提出\tool——面向\sid分词器工件的映射优先诊断资源。\tool为项目映射、元数据、交互和可选生成器迹线定义了小型适配器契约；验证该契约；并报告关于利用率、别名化、邻域对齐、流行度分配和结构成本的映射级探测，附带时间波动和生成器迹线的钩子。在下游排行榜评分之前，\tool即可报告可检查的工件剖析。发布资源涵盖四类分词器工件：针对23,742项音乐项目的同项目GRID/RQ-KMeans风格与ReSID/GAOQ对比，以及已发布的LETTER和LC-Rec项目索引工件。在音乐项目对比中，GRID风格特征文本导出拥有3,749个唯一全码，全码别名化率为0.977，而ReSID/GAOQ导出映射中无别名化。然而，最强的前缀-共现对齐来自确定性类别前缀控制，而非任一种学习导出行（0.447对比0.154和0.055–0.080），这表明可寻址性与行为意义前缀应分开检查。跨域、固定重排器和机制探测检查均支持相同的诊断方向：前缀对齐是候选曝光信号，而最终排序质量仍是下游模型问题。