A Highly Accurate Query-Recovery Attack against Searchable Encryption using Non-Indexed Documents

Cloud data storage solutions offer customers cost-effective and reduced data management. While attractive, data security issues remain to be a core concern. Traditional encryption protects stored documents, but hinders simple functionalities such as keyword search. Therefore, searchable encryption schemes have been proposed to allow for the search on encrypted data. Efficient schemes leak at least the access pattern (the accessed documents per keyword search), which is known to be exploitable in query recovery attacks assuming the attacker has a significant amount of background knowledge on the stored documents. Existing attacks can only achieve decent results with strong adversary models (e.g. at least 20% of previously known documents or require additional knowledge such as on query frequencies) and they give no metric to evaluate the certainty of recovered queries. This hampers their practical utility and questions their relevance in the real-world. We propose a refined score attack which achieves query recovery rates of around 85% without requiring exact background knowledge on stored documents; a distributionally similar, but otherwise different (i.e., non-indexed), dataset suffices. The attack starts with very few known queries (around 10 known queries in our experiments over different datasets of varying size) and then iteratively recovers further queries with confidence scores by adding previously recovered queries that had high confidence scores to the set of known queries. Additional to high recovery rates, our approach yields interpretable results in terms of confidence scores.

翻译：云数据存储解决方案为用户提供了成本效益高且简化的数据管理方式。尽管具有吸引力，但数据安全问题仍是核心关注点。传统加密方法能保护存储的文档，但会阻碍关键词搜索等简单功能的实现。因此，可搜索加密方案被提出，以允许对加密数据进行搜索。高效的方案至少会泄露访问模式（每次关键词搜索所访问的文档），这在查询恢复攻击中已知可被利用，前提是攻击者对存储文档拥有大量背景知识。现有攻击仅在强敌手模型下才能取得较好结果（例如，至少掌握20%的已知文档，或需要如查询频率等额外知识），且无法提供评估恢复查询确定性的度量标准。这削弱了其实用性，并对其在实际场景中的相关性提出质疑。我们提出一种改进的评分攻击，无需精确了解存储文档的背景知识即可实现约85%的查询恢复率；仅需一个分布相似但内容不同（即非索引）的数据集即可。该攻击从极少量已知查询开始（在我们对不同规模数据集的实验中，约10个已知查询），然后通过迭代方式，将具有高置信度分数的已恢复查询加入已知查询集，从而逐步恢复更多查询并附带置信度分数。除高恢复率外，我们的方法还能以置信度分数的形式提供可解释的结果。