EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce

Product mapping, the task of deciding whether two e-commerce listings refer to the same product, is a core problem for price monitoring and channel visibility. In real marketplaces, however, sellers frequently inject promotional keywords, platform-specific tags, and bundle descriptions into titles, causing the same product to appear under many different names. Recent LLM-based and multi-agent frameworks improve robustness and interpretability on such hard cases, but they often rely on expensive external APIs, repeated retrieval, and complex inference-time orchestration, making large-scale deployment costly and difficult in privacy-sensitive enterprise settings. To address these issues, we present EPM-RL, a reinforcement-learning-based framework for building an accurate and efficient on-premise e-commerce product mapping model. Our central idea is to distill high-cost agentic reasoning into a trainable in-house model. Starting from a curated set of product pairs with LLM-generated rationales and human verification, we first perform parameter-efficient fine-tuning (PEFT) on a small student model using structured reasoning outputs. We then further optimize the model with Reinforcement Learning (RL) using an agent-based reward that jointly evaluates output-format compliance, label correctness, reasoning--preference scores from specially designed judge models. Preliminary results show that EPM-RL consistently improves over PEFT-only training and offers a stronger quality--cost trade-off than commercial API-based baselines, while enabling private deployment and lower operational cost. These findings suggest that reinforcement learning can turn product mapping from a high-latency agentic pipeline into a scalable, inspectable, and production-ready in-house system.

翻译：产品匹配任务旨在判定两个电商商品列表是否指向同一产品，这是价格监控与渠道可见性领域的核心问题。然而在实际市场中，卖家常将促销关键词、平台专属标签及捆绑销售描述注入商品标题，导致同一产品以多种不同名称呈现。近期基于大语言模型和多智能体的框架虽然提升了此类困难案例的鲁棒性与可解释性，但过度依赖昂贵的外部API、重复检索及复杂推理时编排，使得在隐私敏感的企业场景中难以进行大规模部署且成本高昂。为解决上述问题，我们提出EPM-RL——一种基于强化学习的框架，用于构建精准高效的自有部署电商产品匹配模型。核心思想是将高成本的智能体推理过程蒸馏至可训练的内部模型。首先基于经大语言模型生成推理依据并经人工验证的精选产品对集合，利用结构化推理输出对小型学生模型进行参数高效微调；随后通过强化学习进一步优化模型，采用基于智能体的奖励机制，该奖励机制联合评估输出格式合规性、标签正确性以及经特制评判模型计算的推理偏好分值。初步结果显示，EPM-RL相较仅采用参数高效微调的方案持续提升性能，并在质量-成本权衡方面优于基于商业API的基准方法，同时支持私有化部署与更低运营成本。这些发现表明，强化学习有望将产品匹配从高延迟的智能体流程转变为可扩展、可审计且可直接投入生产的内部系统。