Recent breakthroughs in natural language processing (NLP) have permitted the synthesis and comprehension of coherent text in an open-ended way, therefore translating the theoretical algorithms into practical applications. The large language-model (LLM) has significantly impacted businesses such as report summarization softwares and copywriters. Observations indicate, however, that LLMs may exhibit social prejudice and toxicity, posing ethical and societal dangers of consequences resulting from irresponsibility. Large-scale benchmarks for accountable LLMs should consequently be developed. Although several empirical investigations reveal the existence of a few ethical difficulties in advanced LLMs, there is no systematic examination and user study of the ethics of current LLMs use. To further educate future efforts on constructing ethical LLMs responsibly, we perform a qualitative research method on OpenAI's ChatGPT to better understand the practical features of ethical dangers in recent LLMs. We analyze ChatGPT comprehensively from four perspectives: 1) \textit{Bias} 2) \textit{Reliability} 3) \textit{Robustness} 4) \textit{Toxicity}. In accordance with our stated viewpoints, we empirically benchmark ChatGPT on multiple sample datasets. We find that a significant number of ethical risks cannot be addressed by existing benchmarks, and hence illustrate them via additional case studies. In addition, we examine the implications of our findings on the AI ethics of ChatGPT, as well as future problems and practical design considerations for LLMs. We believe that our findings may give light on future efforts to determine and mitigate the ethical hazards posed by machines in LLM applications.
翻译:自然语言处理(NLP)领域的最新突破使得能够以开放式方式合成和理解连贯文本,从而将理论算法转化为实际应用。大型语言模型(LLM)已显著影响了报告摘要软件和文案撰写等业务。然而,观察表明,LLM可能表现出社会偏见和毒性,带来因不负责任行为而产生的伦理与社会风险。因此,有必要为负责任的LLM开发大规模基准测试。尽管若干实证研究揭示了高级LLM中存在一些伦理困难,但目前尚无对当前LLM使用伦理的系统性检验和用户研究。为了进一步指导未来构建负责任伦理LLM的努力,我们对OpenAI的ChatGPT采用定性研究方法,以更好地理解近期LLM中伦理风险的实际特征。我们从四个角度全面分析ChatGPT:1)\textit{偏见} 2)\textit{可靠性} 3)\textit{鲁棒性} 4)\textit{毒性}。根据我们提出的观点,我们在多个样本数据集上对ChatGPT进行实证基准测试。我们发现大量伦理风险无法通过现有基准得到解决,因此通过额外案例研究加以说明。此外,我们考察了研究结果对ChatGPT的AI伦理的启示,以及未来LLM面临的问题和实际设计考量。我们相信,我们的发现可为未来确定并减轻LLM应用中机器带来的伦理风险的努力提供启示。