An Evaluation of Chat Safety Moderations in Roblox

Roblox is among the most popular online gaming platforms, used by hundreds of millions of users every day. A substantial portion of these users are underage, who are at a greater risk, where abusive users may utilize Roblox's real-time chat interface to make the initial contact with potential victims. Roblox employs automated chat moderation mechanisms to detect potentially abusive messages; however, to date, their effectiveness has not been independently investigated. Toward this goal, we collected approximately 2 million chat messages from four games across multiple age groups and analyzed them to evaluate the moderation system. These messages were collected from public game servers following ethical and legal norms as well as Roblox's terms of service. We use this corpus to qualitatively study which types of unsafe chats escape the moderation system and how policy-violating users evade the moderation system. Given the dataset's scale, it is prohibitively expensive to conduct qualitative content analysis manually. Therefore, we adopt a two-step approach. First, we manually labeled safe and unsafe messages (n=99.8K) and used them as a ground truth to evaluate four locally hosted state-of-the-art large language models (LLMs). Next, the best-performing LLM was applied to the entire corpus to identify potentially unsafe messages, which we manually categorized using iterative open and axial coding methods until thematic saturation was reached. Overall, our findings reveal a troublesome reality: numerous instances of unsafe chat messages related to grooming, sexualizing minors, bullying, & harassment, violence, self-harm, and sharing sensitive information, etc., escaped the current moderation. Our analysis of users whose messages were previously flagged revealed that they continue to send harmful messages by employing a wide range of techniques to evade the moderation system.

翻译：Roblox是当下最流行的在线游戏平台之一，每日用户量达数亿。其中相当比例用户为未成年人，他们面临更高风险——恶意用户可能通过Roblox的实时聊天界面与潜在受害者建立初步联系。Roblox采用自动化聊天审核机制检测可能存在风险的聊天消息，但迄今为止，该机制的有效性尚未经过独立验证。为此，我们从不同年龄段的四个游戏中采集了约200万条聊天消息，通过分析评估其审核系统。所有消息均依照伦理法律规范及Roblox服务条款从公共游戏服务器获取。基于该语料库，我们定性研究了哪些类别的不安全聊天内容能绕过审核机制，以及违规用户如何规避系统检测。鉴于数据集规模庞大，手动进行定性内容分析成本过高，因此我们采用两步法：首先人工标注安全与不安全消息（n=99.8K），以此作为基准评估四款本地部署的最先进大语言模型（LLMs）；接着将表现最优的LLM应用于全量语料识别潜在不安全消息，并通过迭代式开放编码与轴心编码方法进行人工分类，直至主题饱和。总体而言，我们的发现揭示了令人忧虑的现实：大量涉及诱骗、未成年人性化、霸凌与骚扰、暴力、自残、敏感信息泄露等场景的不安全聊天消息成功规避了现有审核机制。对既往被标记用户的分析显示，他们仍在通过多种规避技术持续发送有害信息。