UniScale: Synergistic Entire Space Data and Model Scaling for Search Ranking

Recent advances in Large Language Models (LLMs) have inspired a surge of scaling law research in industrial search, advertising, and recommendation systems. However, existing approaches focus mainly on architectural improvements, overlooking the critical synergy between data and architecture design. We observe that scaling model parameters alone exhibits diminishing returns, i.e., the marginal gain in performance steadily declines as model size increases, and that the performance degradation caused by complex heterogeneous data distributions is often irrecoverable through model design alone. In this paper, we propose UniScale to address these limitation, a novel co-design framework that jointly optimizes data and architecture to unlock the full potential of model scaling, which includes two core parts: (1) ES$^3$ (Entire-Space Sample System), a high-quality data scaling system that expands the training signal beyond conventional sampling strategies from both intra-domain request contexts with global supervised signal constructed by hierarchical label attribution and cross-domain samples aligning with the essence of user decision under similar content exposure environment in search domain; and (2) HHSFT (Heterogeneous Hierarchical Sample Fusion Transformer), a novel architecture designed to effectively model the complex heterogeneous distribution of scaled data and to harness the entire space user behavior data with Heterogeneous Hierarchical Feature Interaction and Entire Space User Interest Fusion, thereby surpassing the performance ceiling of structure-only model tuning. Extensive experiments on large-scale real world E-commerce search platform demonstrate that UniScale achieves significant improvements through the synergistic co-design of data and architecture and exhibits clear scaling trends, delivering substantial gains in key business metrics.

翻译：大语言模型的近期进展激发了工业搜索、广告及推荐系统中扩展律研究的热潮。然而，现有方法主要聚焦于架构改进，忽视了数据与架构设计之间的关键协同效应。我们观察到，单纯扩展模型参数呈现收益递减现象，即随着模型规模增大，性能的边际增益持续下降；同时，复杂异构数据分布造成的性能损耗往往无法通过模型设计单独恢复。针对上述局限，本文提出UniScale——一种联合优化数据与架构以充分释放模型扩展潜力的新型协同设计框架，其包含两个核心组件：（1）ES³（全空间样本系统），一种高质量数据扩展系统。该系统通过层级标签归因构建的全局监督信号，从域内请求上下文中拓展超越传统采样策略的训练信号，并引入对齐搜索域中相似内容曝光环境下用户决策本质的跨域样本；（2）HHSFT（异构层级样本融合Transformer），一种新型架构。该架构通过异构层级特征交互与全空间用户兴趣融合，有效建模扩展数据的复杂异构分布，并充分利用全空间用户行为数据，从而突破单纯结构调优的性能天花板。在大型工业级电商搜索平台上开展的广泛实验表明，UniScale通过数据与架构的协同设计实现了显著性能提升，展现出清晰的扩展趋势，并在核心业务指标上取得实质性增益。