Exploring the Viability of Synthetic Query Generation for Relevance Prediction

Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that -- contrary to claims from prior works -- current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning 3 public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches -- including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label.

翻译：查询-文档相关性预测是信息检索系统中的关键问题。该问题日益采用基于（预训练）Transformer的模型来解决，这些模型通过大规模标注数据集合进行微调。然而，在电子商务和医疗保健等专业领域，这种方法的可行性受到大规模领域内数据匮乏的限制。为应对这一数据不足，近期方法利用这些强大模型生成高质量任务特定和领域特定的合成数据。先前研究主要探索了面向问答（QA）和二元（是/否）相关性预测的合成数据生成或查询生成（QGen），例如，QGen模型接收文档并训练生成与该文档相关的查询。但在许多问题中，我们需处理比简单是/否标签更细粒度的相关性概念。因此，本工作详细研究了QGen方法如何用于精细化的相关性预测。我们证明——与先前研究的结论相反——当前QGen方法逊于更传统的跨领域迁移学习方法。通过在3个公开电子商务基准测试上的实证研究，我们识别出现有QGen方法的新缺陷——包括无法区分不同相关性等级。为解决此问题，我们引入了标签条件化QGen模型，该模型融入不同相关性的知识。尽管实验表明这些改进有助于提升QGen技术的性能，但我们同时发现QGen方法在捕捉相关性标签空间的完整细微差异方面存在困难，导致生成的查询无法忠实反映所需的相关性标签。