Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to filling this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new curated resource to design and evaluate personalized models related to the task of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization remarkably improves the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.
翻译:信息检索中的个性化是一个长期研究的课题。尽管如此,目前仍缺乏高质量、真实世界的数据集来开展大规模实验并评估个性化搜索模型。本文通过引入SE-PQA(StackExchange - 个性化问答)来填补这一空白,这是一个新的精心策划的资源,用于设计和评估与社区问答任务相关的个性化模型。该贡献数据集包含超过100万个查询和200万个答案,并标注了丰富的特征集,这些特征模拟了流行社区问答平台用户之间的社交互动。我们描述了SE-PQA的特点,并详细说明了与问题和答案相关的特征。我们还基于该资源提供了社区问答任务的可重复基线方法,包括深度学习模型和个性化方法。初步实验结果表明,SE-PQA适用于训练有效的社区问答模型;同时,个性化显著提升了所有测试方法的有效性。此外,我们展示了结合多个社区数据进行个性化在鲁棒性和泛化性方面的优势。