Large archival collections, such as email or government documents, must be manually reviewed to identify any sensitive information before the collection can be released publicly. Sensitivity classification has received a lot of attention in the literature. However, more recently, there has been increasing interest in developing sensitivity-aware search engines that can provide users with relevant search results, while ensuring that no sensitive documents are returned to the user. Sensitivity-aware search would mitigate the need for a manual sensitivity review prior to collections being made available publicly. To develop such systems, there is a need for test collections that contain relevance assessments for a set of information needs as well as ground-truth labels for a variety of sensitivity categories. The well-known Enron email collection contains a classification ground-truth that can be used to represent sensitive information, e.g., the Purely Personal and Personal but in Professional Context categories can be used to represent sensitive personal information. However, the existing Enron collection does not contain a set of information needs and relevance assessments. In this work, we present a collection of fifty information needs (topics) with crowdsourced query formulations (3 per topic) and relevance assessments (11,471 in total) for the Enron collection (mean number of relevant documents per topic = 11, variance = 34.7). The developed information needs, queries and relevance judgements are available on GitHub and will be available along with the existing Enron collection through the popular ir_datasets library. Our proposed collection results in the first freely available test collection for developing sensitivity-aware search systems.
翻译:大型档案集合(如电子邮件或政府文件)必须经过人工审核,识别出其中的敏感信息后方可公开发布。现有文献对敏感信息分类问题已有广泛研究。但近年来,人们日益关注开发灵敏度感知搜索引擎——这类系统既能提供相关搜索结果,又能确保不向用户返回敏感文档。灵敏度感知搜索将减少对集合公开发布前进行人工审核的需求。要开发此类系统,需要构建包含多组信息需求的相关性评估及多种敏感类别真实标签的测试集。著名的安然(Enron)电子邮件集合包含可用于表征敏感信息的分类真实标签,例如"纯个人"和"职业语境下的个人"类别可代表敏感个人信息。然而现有安然集合缺乏配套的信息需求与相关性评估。本研究针对安然集合开发了包含五十个信息需求(主题)的测试集,并附有众包查询表述(每个主题3个)及相关性评估(总计11471条)(每个主题相关文档数均值=11,方差=34.7)。所开发的信息需求、查询及相关性判断已发布于GitHub,并将通过流行的ir_datasets库与现有安然集合一并提供。本项工作提出的测试集构成了首个面向灵敏度感知搜索系统开发的免费可用测试资源。