This study investigates the optimal selection of parameters for collaborative clustering while ensuring data privacy. We focus on key clustering algorithms within a collaborative framework, where multiple data owners combine their data. A semi-trusted server assists in recommending the most suitable clustering algorithm and its parameters. Our findings indicate that the privacy parameter ($\epsilon$) minimally impacts the server's recommendations, but an increase in $\epsilon$ raises the risk of membership inference attacks, where sensitive information might be inferred. To mitigate these risks, we implement differential privacy techniques, particularly the Randomized Response mechanism, to add noise and protect data privacy. Our approach demonstrates that high-quality clustering can be achieved while maintaining data confidentiality, as evidenced by metrics such as the Adjusted Rand Index and Silhouette Score. This study contributes to privacy-aware data sharing, optimal algorithm and parameter selection, and effective communication between data owners and the server.
翻译:本研究探讨了在确保数据隐私的前提下,协作聚类中参数的最优选择问题。我们聚焦于协作框架内的关键聚类算法,该框架允许多个数据所有者联合处理其数据。一个半可信服务器协助推荐最合适的聚类算法及其参数。我们的研究结果表明,隐私参数($\epsilon$)对服务器的推荐影响极小,但$\epsilon$的增加会提升成员推断攻击的风险,即敏感信息可能被推测出来。为缓解这些风险,我们采用了差分隐私技术,特别是随机响应机制,通过添加噪声来保护数据隐私。我们的方法表明,可以在维持数据机密性的同时实现高质量的聚类,这由调整兰德指数和轮廓系数等指标所证实。本研究为隐私感知的数据共享、最优算法与参数选择以及数据所有者与服务器之间的有效沟通做出了贡献。