Robots must balance compliance with safety and social expectations as blind obedience can cause harm, while over-refusal erodes trust. Existing safe reinforcement learning (RL) benchmarks emphasize physical hazards, while human-robot interaction trust studies are small-scale and hard to reproduce. We present the Empathic Ethical Disobedience (EED) Gym, a standardized testbed that jointly evaluates refusal safety and social acceptability. Agents weigh risk, affect, and trust when choosing to comply, refuse (with or without explanation), clarify, or propose safer alternatives. EED Gym provides different scenarios, multiple persona profiles, and metrics for safety, calibration, and refusals, with trust and blame models grounded in a vignette study. Using EED Gym, we find that action masking eliminates unsafe compliance, while explanatory refusals help sustain trust. Constructive styles are rated most trustworthy, empathic styles -- most empathic, and safe RL methods improve robustness but also make agents more prone to overly cautious behavior. We release code, configurations, and reference policies to enable reproducible evaluation and systematic human-robot interaction research on refusal and trust. At submission time, we include an anonymized reproducibility package with code and configs, and we commit to open-sourcing the full repository after the paper is accepted.
翻译:机器人必须在服从性与安全及社会期望之间取得平衡,因为盲目服从可能导致伤害,而过度拒绝则会损害信任。现有的安全强化学习基准主要关注物理风险,而人机交互信任研究则规模较小且难以复现。我们提出了共情伦理违抗(EED)测试平台,这是一个标准化测试环境,可联合评估拒绝安全性与社会可接受性。智能体在决定服从、拒绝(无论是否提供解释)、澄清或提出更安全的替代方案时,需权衡风险、情感影响与信任。EED测试平台提供多样化场景、多重角色配置文件,以及涵盖安全性、校准度和拒绝行为的评估指标,其信任与归责模型基于情景研究构建。通过使用EED测试平台,我们发现动作屏蔽能完全消除不安全服从行为,而解释性拒绝有助于维持信任。建设性行为风格被评为最值得信赖,共情性风格则最具共情力;安全强化学习方法虽提升了系统鲁棒性,但也使智能体更容易表现出过度谨慎的行为。我们公开了代码、配置文件和参考策略,以支持可复现的评估及针对拒绝行为与信任的系统性人机交互研究。在提交时,我们提供了包含代码与配置的匿名复现包,并承诺在论文录用后开源完整代码库。