词汇中的特洛伊木马：对大型语言模型组合的隐秘破坏 (The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition)

The open-weight LLM ecosystem is increasingly defined by model composition techniques (such as weight merging, speculative decoding, and vocabulary expansion) that remix capabilities from diverse sources. A critical prerequisite for applying these methods across different model families is tokenizer transplant, which aligns incompatible vocabularies to a shared embedding space. We demonstrate that this essential interoperability step introduces a supply-chain vulnerability: we engineer a single "breaker token" that is functionally inert in a donor model yet reliably reconstructs into a high-salience malicious feature after transplant into a base model. By exploiting the geometry of coefficient reuse, our attack creates an asymmetric realizability gap that sabotages the base model's generation while leaving the donor's utility statistically indistinguishable from nominal behavior. We formalize this as a dual-objective optimization problem and instantiate the attack using a sparse solver. Empirically, the attack is training-free and achieves spectral mimicry to evade outlier detection, while demonstrating structural persistence against fine-tuning and weight merging, highlighting a hidden risk in the pipeline of modular AI composition. Code is available at https://github.com/xz-liu/tokenforge

翻译：开放权重的大型语言模型（LLM）生态系统正日益由模型组合技术（如权重合并、推测解码和词汇扩展）所定义，这些技术融合了来自不同来源的能力。在不同模型系列间应用这些方法的一个关键前提是分词器移植，即将不兼容的词汇表对齐到共享的嵌入空间。我们证明，这一关键的互操作性步骤引入了供应链漏洞：我们设计了一个单一的"破坏性标记"，该标记在供体模型中功能上是惰性的，但在移植到基础模型后，却能可靠地重构为高显著性的恶意特征。通过利用系数重用的几何特性，我们的攻击创造了一个非对称可实现性差距，从而破坏基础模型的生成，同时使供体模型的效用与名义行为在统计上无法区分。我们将此形式化为一个双目标优化问题，并使用稀疏求解器实例化了该攻击。实验表明，该攻击无需训练，通过谱模仿来规避异常值检测，同时展现出对抗微调和权重合并的结构持久性，凸显了模块化人工智能组合流程中的隐藏风险。代码可在 https://github.com/xz-liu/tokenforge 获取。