When using a public communication channel--whether formal or informal, such as commenting or posting on social media--end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence--using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting--one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message--necessarily open for public consumption--exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
翻译:在公共通信渠道(无论是正式还是非正式场景,例如在社交媒体上评论或发帖)中,最终用户对隐私无任何期待:他们撰写消息并向全世界广播。即便用户采取最严密的匿名化措施——使用化名或假名、隐藏IP地址、伪造地理定位、隐藏操作系统和用户代理、部署加密技术、通过一次性电话或邮箱注册、关闭非必要设置、撤销权限、屏蔽Cookie和指纹追踪——仍有一个显而易见的关键元素留存:消息本身。假设用户避免判断失误或意外暴露身份,那么几乎不存在可验证其真实身份的证据,对吗?错!其消息内容(必须公开供大众消费)暴露了一个攻击向量:文体风格分析(即作者画像)。本文深入剖析文体风格分析技术,探讨与之对抗的反向文体风格分析策略,并通过Unicode隐写术提出增强方案。