RexBERT: Context Specialized Bidirectional Encoders for E-commerce

from arxiv, Blog: https://huggingface.co/blog/thebajajra/rexbert-encoders Models: https://huggingface.co/collections/thebajajra/rexbert Ecom-niverse Dataset: https://huggingface.co/datasets/thebajajra/Ecom-niverse

Encoder-only transformers remain indispensable in retrieval, classification, and ranking systems where latency, stability, and cost are paramount. Most general purpose encoders, however, are trained on generic corpora with limited coverage of specialized domains. We introduce RexBERT, a family of BERT-style encoders designed specifically for e-commerce semantics. We make three contributions. First, we release Ecom-niverse, a 350 billion token corpus curated from diverse retail and shopping sources. We describe a modular pipeline that isolates and extracts e-commerce content from FineFineWeb and other open web resources, and characterize the resulting domain distribution. Second, we present a reproducible pretraining recipe building on ModernBERT's architectural advances. The recipe consists of three phases: general pre-training, context extension, and annealed domain specialization. Third, we train RexBERT models ranging from 17M to 400M parameters and evaluate them on token classification, semantic similarity, and general natural language understanding tasks using e-commerce datasets. Despite having 2-3x fewer parameters, RexBERT outperforms larger general-purpose encoders and matches or surpasses modern long-context models on domain-specific benchmarks. Our results demonstrate that high quality in-domain data combined with a principled training approach provides a stronger foundation for e-commerce applications than indiscriminate scaling alone.

翻译：编码器专用Transformer在检索、分类和排序系统中仍然不可或缺，这些系统对延迟、稳定性和成本有严格要求。然而，大多数通用编码器是在通用语料库上训练的，对专业领域的覆盖有限。我们提出了RexBERT，一个专门为电子商务语义设计的BERT风格编码器系列。我们做出了三项贡献。首先，我们发布了Ecom-niverse，这是一个从多样化零售和购物来源中整理的3500亿词元语料库。我们描述了一个模块化流水线，用于从FineFineWeb和其他开放网络资源中隔离和提取电子商务内容，并对生成的领域分布进行了特征分析。其次，我们提出了一种基于ModernBERT架构改进的可复现预训练方案。该方案包含三个阶段：通用预训练、上下文扩展和退火领域专业化。第三，我们训练了参数规模从1700万到4亿不等的RexBERT模型，并使用电子商务数据集在词元分类、语义相似性和通用自然语言理解任务上对其进行了评估。尽管参数数量减少了2-3倍，RexBERT在领域特定基准测试中超越了更大的通用编码器，并达到或超越了现代长上下文模型的性能。我们的结果表明，高质量领域内数据与原则性训练方法相结合，为电子商务应用提供了比单纯无差别扩展更坚实的基础。

相关内容

电子商务

关注 2

电子商务（ Electronic Commerce）的定义： 电子商务是利用计算机技术、网络技术和远程通信技术，实现电子化、数字化和网络化的整个商务过程。　　联合国国际贸易程序简化工作组对电子商务的定义是：采用电子形式开展商务活动，它包括在供应商、客户、政府及其他参与方之间通过任何电子工具，如 EDI、 Web技术、电子邮件等共享非结构化商务信息，并管理和完成在商务活动、管理活动和消费活动中的各种交易。