Encoder-only transformers remain indispensable in retrieval, classification, and ranking systems where latency, stability, and cost are paramount. Most general purpose encoders, however, are trained on generic corpora with limited coverage of specialized domains. We introduce RexBERT, a family of BERT-style encoders designed specifically for e-commerce semantics. We make three contributions. First, we release Ecom-niverse, a 350 billion token corpus curated from diverse retail and shopping sources. We describe a modular pipeline that isolates and extracts e-commerce content from FineFineWeb and other open web resources, and characterize the resulting domain distribution. Second, we present a reproducible pretraining recipe building on ModernBERT's architectural advances. The recipe consists of three phases: general pre-training, context extension, and annealed domain specialization. Third, we train RexBERT models ranging from 17M to 400M parameters and evaluate them on token classification, semantic similarity, and general natural language understanding tasks using e-commerce datasets. Despite having 2-3x fewer parameters, RexBERT outperforms larger general-purpose encoders and matches or surpasses modern long-context models on domain-specific benchmarks. Our results demonstrate that high quality in-domain data combined with a principled training approach provides a stronger foundation for e-commerce applications than indiscriminate scaling alone.
翻译:编码器专用Transformer在检索、分类和排序系统中仍然不可或缺,这些系统对延迟、稳定性和成本有严格要求。然而,大多数通用编码器是在通用语料库上训练的,对专业领域的覆盖有限。我们提出了RexBERT,一个专门为电子商务语义设计的BERT风格编码器系列。我们做出了三项贡献。首先,我们发布了Ecom-niverse,这是一个从多样化零售和购物来源中整理的3500亿词元语料库。我们描述了一个模块化流水线,用于从FineFineWeb和其他开放网络资源中隔离和提取电子商务内容,并对生成的领域分布进行了特征分析。其次,我们提出了一种基于ModernBERT架构改进的可复现预训练方案。该方案包含三个阶段:通用预训练、上下文扩展和退火领域专业化。第三,我们训练了参数规模从1700万到4亿不等的RexBERT模型,并使用电子商务数据集在词元分类、语义相似性和通用自然语言理解任务上对其进行了评估。尽管参数数量减少了2-3倍,RexBERT在领域特定基准测试中超越了更大的通用编码器,并达到或超越了现代长上下文模型的性能。我们的结果表明,高质量领域内数据与原则性训练方法相结合,为电子商务应用提供了比单纯无差别扩展更坚实的基础。