In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.
翻译:摘要:政府颁发的身份证件(如护照、驾照等)的数据泄露,与匿名社交媒体平台上一次随意的自愿披露相比,其危害程度有何不同?乍看之下,前者似乎更为严重,这一判断也合情合理。泄露的数据可能包含个人的出生日期和地址;从任何实际意义上看,此类数据泄露都将造成灾难性后果。基于此威胁,后一种涉及无害在线帖子的情景似乎相对安全——真是如此吗?通过该帖子及同类文本,司法语言学鉴识者可通过风格计量分析揭示等价信息:推断作者的年龄段(青少年或成人)并缩小其地理位置范围(特定国家)。尽管并非精确科学(其结论具有统计性质),但风格计量学仍能揭示个体具有可比性但明显稀释的信息。防止身份证件被泄露,只需尽可能减少共享即可;而阻止书面文本泄露个人信息则需要更复杂的解决方案:对抗性风格计量学。本文探究了在同形替代——即用视觉相似字符替换原字符(例如,“h”[U+0068] →“h”[U+04BB])——处理文本后,如何削弱风格计量系统的效能。