Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
翻译:尽管自然语言处理(NLP)领域已将中心化差分隐私作为隐私保护模型训练或数据共享的主流框架,但控制隐私保护强度的关键参数——隐私预算$\varepsilon$的选择与解读仍存在较大随意性。我们认为,$\varepsilon$值的确定不应仅由研究人员或系统开发者决定,还必须考虑实际分享潜在敏感数据的用户。换言之:您是否愿意为$\varepsilon=10$的隐私保护分享即时通讯记录?为填补这一研究空白,我们设计、实施并开展了一项行为实验(311名非专业参与者),研究人们在面临隐私威胁情境时的不确定性决策行为。通过将风险感知锚定于两个真实NLP场景并采用情景行为实验法,我们得以确定促使非专业用户愿意分享敏感文本数据的$\varepsilon$阈值——据我们所知,这是该领域的首项此类研究。