Session-level Normalization and Click-through Data Enhancement for Session-based Evaluation

Since a user usually has to issue a sequence of queries and examine multiple documents to resolve a complex information need in a search session, researchers have paid much attention to evaluating search systems at the session level rather than the single-query level. Most existing session-level metrics evaluate each query separately and then aggregate the query-level scores using a session-level weighting function. The assumptions behind these metrics are that all queries in the session should be involved, and their orders are fixed. However, if a search system could make the user satisfied with her first few queries, she may not need any subsequent queries. Besides, in most real-world search scenarios, due to a lack of explicit feedback from real users, we can only leverage some implicit feedback, such as users' clicks, as relevance labels for offline evaluation. Such implicit feedback might be different from the real relevance in a search session as some documents may be omitted in the previous query but identified in the later reformulations. To address the above issues, we make two assumptions about session-based evaluation, which explicitly describe an ideal session-search system and how to enhance click-through data in computing session-level evaluation metrics. Based on our assumptions, we design a session-level metric called Normalized U-Measure (NUM). NUM evaluates a session as a whole and utilizes an ideal session to normalize the result of the actual session. Besides, it infers session-level relevance labels based on implicit feedback. Experiments on two public datasets demonstrate the effectiveness of NUM by comparing it with existing session-based metrics in terms of correlation with user satisfaction and intuitiveness. We also conduct ablation studies to explore whether these assumptions hold.

翻译：用户在搜索会话中为解决复杂信息需求，通常需要提交查询序列并浏览多个文档。因此，研究者愈发关注在会话层面（而非单查询层面）评估搜索系统。现有大部分会话级指标逐查询评估，再通过会话级加权函数聚合查询级分数。这些指标假设会话中所有查询都必须参与且顺序固定。然而，若搜索系统能让用户在前几个查询就满足需求，后续查询可能不再必要。此外，在多数真实搜索场景中，由于缺乏真实用户的显式反馈，我们只能利用用户点击等隐式反馈作为离线评估的相关性标签。此类隐式反馈可能与搜索会话的真实相关性存在差异——部分文档可能在前序查询中被忽略，却在后续查询重构中被发现。针对上述问题，我们提出两项关于会话评估的假设，明确描述理想会话搜索系统的特征，以及如何增强点击数据以计算会话级评估指标。基于这些假设，我们设计了名为归一化U度量（NUM）的会话级指标。NUM将会话作为整体评估，并利用理想会话对实际会话结果进行归一化处理，同时基于隐式反馈推断会话级相关性标签。在两组公开数据集上的实验表明，通过与现有会话指标在用户满意度相关性和直观性方面的对比，NUM展现出有效性。我们还通过消融研究验证了这些假设的合理性。