In large organisations, identifying experts on a given topic is crucial in leveraging the internal knowledge spread across teams and departments. So-called enterprise expert retrieval systems automatically discover and structure employees' expertise based on the vast amount of heterogeneous data available about them and the work they perform. Evaluating these systems requires comprehensive ground truth expert annotations, which are hard to obtain. Therefore, the annotation process typically relies on automated recommendations of knowledge areas to validate. This case study provides an analysis of how these recommendations can impact the evaluation of expert finding systems. We demonstrate on a popular benchmark that system-validated annotations lead to overestimated performance of traditional term-based retrieval models and even invalidate comparisons with more recent neural methods. We also augment knowledge areas with synonyms to uncover a strong bias towards literal mentions of their constituent words. Finally, we propose constraints to the annotation process to prevent these biased evaluations, and show that this still allows annotation suggestions of high utility. These findings should inform benchmark creation or selection for expert finding, to guarantee meaningful comparison of methods.
翻译:在大型组织中,识别特定主题的专家对于利用分散在各团队与部门间的内部知识至关重要。所谓的企业专家检索系统基于员工可用的大量异构数据及其工作内容,自动发现并构建员工的专业知识体系。评估此类系统需要全面的真实专家标注数据,而这通常难以获取。因此,标注过程往往依赖于自动化推荐的知识领域进行验证。本案例研究分析了此类推荐如何影响专家发现系统的评估。我们在一个常用基准测试上证明,经系统验证的标注会导致基于传统术语的检索模型性能被高估,甚至使与较新神经方法的比较失效。我们还通过同义词扩展知识领域,揭示了系统对其构成词汇字面提及的强烈偏好。最后,我们提出对标注过程的约束条件以防止此类偏差评估,并证明这仍能提供高效用的标注建议。这些发现应指导专家发现领域基准的创建或选择,以确保方法间具有意义的比较。