We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.
翻译:我们提出一种针对专家混合(MoE)语言模型的文本重构攻击方法,该方法仅通过专家选择信息即可恢复原始词汇单元。在MoE模型中,每个词汇单元会被路由至专家子网络的特定子集;我们证明这些路由决策所泄露的信息远超既有认知。先前基于逻辑回归的研究仅能实现有限的重构效果;我们通过三层MLP将Top-1准确率提升至63.1%,并进一步采用基于Transformer的序列解码器,在OpenWebText数据集的32词汇单元序列上达到91.2%的Top-1准确率(Top-10准确率94.8%),该模型仅需1亿词汇单元的训练数据。这些发现将MoE路由机制与嵌入反演研究领域建立理论关联。我们系统阐述了实际数据泄露场景(如分布式推理与侧信道攻击),并证明噪声注入虽能降低但无法完全消除重构风险。本研究结果表明,在MoE系统部署中,专家选择信息应与原始文本同等视为敏感数据。