On the Necessity of World Knowledge for Mitigating Missing Labels in Extreme Classification

Extreme Classification (XC) aims to map a query to the most relevant documents from a very large document set. XC algorithms used in real-world applications learn this mapping from datasets curated from implicit feedback, such as user clicks. However, these datasets inevitably suffer from missing labels. In this work, we observe that systematic missing labels lead to missing knowledge, which is critical for accurately modelling relevance between queries and documents. We formally show that this absence of knowledge cannot be recovered using existing methods such as propensity weighting and data imputation strategies that solely rely on the training dataset. While LLMs provide an attractive solution to augment the missing knowledge, leveraging them in applications with low latency requirements and large document sets is challenging. To incorporate missing knowledge at scale, we propose SKIM (Scalable Knowledge Infusion for Missing Labels), an algorithm that leverages a combination of small LM and abundant unstructured meta-data to effectively mitigate the missing label problem. We show the efficacy of our method on large-scale public datasets through exhaustive unbiased evaluation ranging from human annotations to simulations inspired from industrial settings. SKIM outperforms existing methods on Recall@100 by more than 10 absolute points. Additionally, SKIM scales to proprietary query-ad retrieval datasets containing 10 million documents, outperforming contemporary methods by 12% in offline evaluation and increased ad click-yield by 1.23% in an online A/B test conducted on a popular search engine. We release our code, prompts, trained XC models and finetuned SLMs at: https://github.com/bicycleman15/skim

翻译：极端分类（XC）旨在将查询映射到海量文档集合中最相关的文档。实际应用中使用的XC算法通常从隐式反馈（如用户点击）构建的数据集中学习这种映射关系。然而，这些数据集不可避免地存在标签缺失问题。本研究发现，系统性的标签缺失会导致关键知识的缺失，而这种知识对于精确建模查询与文档之间的相关性至关重要。我们通过理论证明，仅依赖训练数据的现有方法（如倾向性加权和数据插补策略）无法恢复这类缺失知识。虽然大型语言模型为知识增强提供了可行方案，但在低延迟要求和大规模文档集的应用场景中直接使用仍面临挑战。为实现缺失知识的大规模融合，我们提出SKIM算法（面向缺失标签的可扩展知识注入），该算法通过结合小型语言模型与丰富的非结构化元数据，有效缓解标签缺失问题。我们在大型公开数据集上进行了从人工标注到工业场景模拟的全面无偏评估，验证了方法的有效性。SKIM在Recall@100指标上以超过10个百分点的绝对优势优于现有方法。此外，SKIM可扩展至包含1000万文档的专有查询-广告检索数据集，在离线评估中优于现有方法12%，并在某主流搜索引擎的在线A/B测试中实现广告点击率提升1.23%。我们已开源代码、提示模板、训练的XC模型及微调的小型语言模型：https://github.com/bicycleman15/skim