Exploring the Viability of Synthetic Query Generation for Relevance Prediction

Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that -- contrary to claims from prior works -- current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning 3 public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches -- including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label.

翻译：查询-文档相关性预测是信息检索系统中的关键问题。这一问题日益通过使用（预训练）基于Transformer的模型来解决，这些模型利用大规模标注数据进行微调。然而，在电子商务和医疗保健等专业领域，此方法的可行性受到缺乏大规模领域内数据的限制。为应对这一数据匮乏，近期方法利用这些强大模型生成高质量的任务和领域特定合成数据。先前工作主要探索了用于问答（QA）和二元（是/否）相关性预测的合成数据生成或查询生成（QGen），例如，QGen模型会接收一个文档，并训练生成与该文档相关的查询。但在许多问题中，我们拥有的相关性概念比简单的“是/否”标签更精细。因此，在本工作中，我们对如何利用QGen方法进行精细相关性预测开展了详细研究。我们证明——与先前工作的主张相反——当前QGen方法未能超越更传统的跨领域迁移学习方法。通过涵盖3个公开电子商务基准的实验研究，我们发现现有QGen方法的新缺陷——包括其无法区分不同级别的相关性。为解决此问题，我们引入了标签条件QGen模型，该模型整合了关于不同相关性的知识。虽然实验表明这些改进有助于提升QGen技术的性能，但我们也发现QGen方法难以完全捕捉相关性标签空间的细微差别，导致生成的查询未能忠实反映期望的相关性标签。