We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.
翻译:我们提出了一种针对专家混合(MoE)语言模型的文本重构攻击方法,该方法仅通过专家选择信息即可恢复原始文本标记。在MoE模型中,每个标记会被路由到专家子网络的子集;我们证明这些路由决策泄露的信息远超既往认知。先前基于逻辑回归的研究仅实现有限的重构效果;我们通过三层MLP将Top-1准确率提升至63.1%,并进一步采用基于Transformer的序列解码器,在OpenWebText数据集的32标记序列上达到91.2%的Top-1准确率(Top-10准确率为94.8%),该模型仅需1亿标记的训练数据。这些发现将MoE路由机制与嵌入反演研究领域建立了理论联系。我们系统阐述了实际数据泄露场景(如分布式推理和侧信道攻击),并证明添加噪声虽能降低但无法完全消除重构风险。本研究结果表明,在MoE系统部署中,专家选择信息应被视为与原始文本同等敏感的数据。