Enhancing Healthcare Search Intent Recognition with Query Representation Learning and Session Context

Classifying the intent behind healthcare search queries is crucial for improving the delivery of online healthcare information. The intricate nature of medical search queries, coupled with the limited availability of high-quality labeled data, presents substantial challenges for developing efficient classification models. Previous studies have exploited user interaction data, such as user clicks from search logs and employed pairwise loss functions to model co-click behavior for query representation learning. However, many health queries could have multiple intents, resulting in ambiguous or divergent click behavior. Furthermore, learning the single most popular intent of queries as inferred from global statistics based on the aggregate behavior of different users could potentially lead to disparity and performance drop when classifying the query intent within specific search sessions. To address these limitations, our work improves the query representation learning by aggregating similar queries via clustering, and introducing a novel loss function designed to capture the multifaceted nature of health search queries, resulting in a more scalable and accurate learning procedure. Furthermore, we quantify the ambiguity of health queries and the misalignment between global search intents and those discerned from individual sessions, by introducing the concordance rate (CR) score, and demonstrate a simple and effective method for incorporating our learned query representation into contextual, session-based search intent classification. Our extensive experimental results and analysis on two real-world search log datasets, i.e., a Health Search (HS) dataset and the publicly available TripClick dataset, demonstrate that our approach not only improves the intrinsic clustering metrics for query representation learning but also enhances accuracy for subsequent search intent classification tasks.

翻译：医疗搜索查询背后的意图分类对于改善在线医疗信息的传递至关重要。医疗查询的复杂性质，加上高质量标注数据有限，给开发高效分类模型带来了巨大挑战。以往的研究利用用户交互数据（例如搜索日志中的点击行为），并采用成对损失函数来建模共点击行为，以实现查询表示学习。然而，许多医疗查询可能具有多种意图，导致点击行为模糊或分散。此外，根据不同用户整体行为推断出的全局统计结果，仅学习查询的单一最热门意图，可能导致在特定搜索会话中分类查询意图时出现偏差和性能下降。为解决这些局限，我们的工作通过聚类聚合相似查询，并引入一种旨在捕捉医疗查询多面性特征的新型损失函数，改进了查询表示学习，从而实现了更具扩展性和精确性的学习过程。此外，我们通过引入一致率评分来量化医疗查询的模糊性，以及全局搜索意图与从单个会话中识别出的意图之间的错位，并展示了一种简单有效的方法，将学习到的查询表示融入基于上下文的会话式搜索意图分类中。我们在两个真实世界的搜索日志数据集（即医疗搜索数据集和公开的TripClick数据集）上进行的大量实验和结果分析表明，我们的方法不仅改善了查询表示学习的内在聚类指标，还提高了后续搜索意图分类任务的准确性。