In search and recommendation systems, predictive models often suffer from temporal instability when certain input features introduce volatility in output scores. This instability can degrade model reliability and user experience especially in multi-stage systems where consistent predictions are critical for downstream decision making. We introduce Fortress, a general framework for enhancing model stability and accuracy by identifying and pruning features that contribute to inconsistent prediction scores over time. Fortress leverages historical snapshots temporally partitioned datasets capturing score fluctuations for the same entity across periods and follows a four-step process: (1) collect historical snapshots, (2) identify samples with unstable predictions, (3) isolate and remove instability-inducing features, and (4) retrain models using only stable features. While semantic features from LLMs and BERT-based models improve generalization, they often lack full query or entity coverage. Engagement-based features offer strong predictive power but tend to introduce temporal instability. Fortress mitigates this trade-off by suppressing the volatility of engagement signals while retaining their predictive value leading to more stable and accurate models. We validate Fortress on a query-to-app relevance model in a large-scale app marketplace. Offline experiments demonstrate notable improvements in prediction stability (measured by Coefficient of Variation) and classification performance (measured by PR-AUC).
翻译:在搜索与推荐系统中,当某些输入特征引发输出分数的波动性时,预测模型常面临时序不稳定性问题。这种不稳定性会降低模型可靠性与用户体验,尤其在多级系统中——其中一致性预测对下游决策至关重要。我们提出Fortress,一个通过识别并剪除随时间导致预测分数不一致的特征,来增强模型稳定性与准确性的通用框架。Fortress利用历史快照(按时间划分的数据集,捕捉同一实体在不同时期的分数波动),遵循四步流程:(1)收集历史快照;(2)识别预测不稳定的样本;(3)分离并移除引发不稳定的特征;(4)仅使用稳定特征重新训练模型。尽管基于LLM与BERT的语义特征能提升泛化性,但常缺乏对完整查询或实体覆盖。基于交互的特征虽具强预测能力,却易引发时序不稳定性。Fortress通过抑制交互信号的波动性同时保留其预测价值,缓解这一权衡,从而构建更稳定、更精确的模型。我们在大规模应用市场中的查询-应用相关性模型上验证了Fortress。离线实验表明,其在预测稳定性(以变异系数衡量)与分类性能(以PR-AUC衡量)上均有显著提升。