User-configured chatbots built on top of large language models are increasingly available through centralized marketplaces such as OpenAI's GPT Store. While these platforms enforce usage policies intended to prevent harmful or inappropriate behavior, the scale and opacity of customized chatbots make systematic policy enforcement challenging. As a result, policy-violating chatbots continue to remain publicly accessible despite existing review processes. This paper presents a fully automated method for evaluating the compliance of Custom GPTs with its marketplace usage policy using black-box interaction. The method combines large-scale GPT discovery, policy-driven red-teaming prompts, and automated compliance assessment using an LLM-as-a-judge. We focus on three policy-relevant domains explicitly addressed in OpenAI's usage policies: Romantic, Cybersecurity, and Academic GPTs. We validate our compliance assessment component against a human-annotated ground-truth dataset, achieving an F1 score of 0.975 for binary policy violation detection. We then apply the method in a large-scale empirical study of 782 Custom GPTs retrieved from the GPT Store. The results show that 58.7% of the evaluated GPTs exhibit at least one policy-violating response, with substantial variation across policy domains. A comparison with the base models (GPT-4 and GPT-4o) indicates that most violations originate from model-level behavior, while customization tends to amplify these tendencies rather than create new failure modes. Our findings reveal limitations in current review mechanisms for user-configured chatbots and demonstrate the feasibility of scalable, behavior-based policy compliance evaluation.
翻译:基于大型语言模型构建的用户配置聊天机器人正通过OpenAI的GPT商店等集中化市场日益普及。尽管这些平台通过使用政策来防止有害或不恰当行为,但定制聊天机器人的规模与不透明性使得系统性政策执行面临挑战。因此,尽管存在审查流程,违反政策的聊天机器人仍持续公开可访问。本文提出一种全自动化方法,通过黑盒交互评估定制GPT是否符合其市场使用政策。该方法结合大规模GPT发现、政策驱动的红队提示以及基于LLM作为评判器的自动化合规评估。我们聚焦于OpenAI使用政策明确涉及的三个政策相关领域:浪漫类、网络安全类和学术类GPT。我们通过人工标注的真实数据集验证合规评估组件,在二元政策违规检测中达到0.975的F1分数。随后将该方法应用于从GPT商店检索的782个定制GPT的大规模实证研究。结果显示,58.7%的受评估GPT表现出至少一次政策违规响应,且不同政策领域存在显著差异。与基础模型(GPT-4和GPT-4o)的对比表明,大多数违规行为源于模型层级特性,而定制化往往会放大这些倾向而非创造新的失效模式。我们的研究揭示了当前用户配置聊天机器人审查机制的局限性,并证明了可扩展的基于行为的政策合规评估的可行性。