As Large Language Models (LLMs) continue to revolutionize Natural Language Processing (NLP) applications, critical concerns about their trustworthiness persist, particularly in safety and robustness. To address these challenges, we introduce TRUSTVIS, an automated evaluation framework that provides a comprehensive assessment of LLM trustworthiness. A key feature of our framework is its interactive user interface, designed to offer intuitive visualizations of trustworthiness metrics. By integrating well-known perturbation methods like AutoDAN and employing majority voting across various evaluation methods, TRUSTVIS not only provides reliable results but also makes complex evaluation processes accessible to users. Preliminary case studies on models like Vicuna-7b, Llama2-7b, and GPT-3.5 demonstrate the effectiveness of our framework in identifying safety and robustness vulnerabilities, while the interactive interface allows users to explore results in detail, empowering targeted model improvements. Video Link: https://youtu.be/k1TrBqNVg8g
翻译:随着大语言模型持续推动自然语言处理应用的变革,其可信度问题——尤其是在安全性与鲁棒性方面——仍备受关注。为应对这些挑战,我们提出了TRUSTVIS,一个能够对大语言模型可信度进行全面评估的自动化框架。该框架的核心特性是其交互式用户界面,旨在为用户提供直观的可信度指标可视化。通过集成AutoDAN等成熟的扰动方法,并采用多种评估方法间的多数投票机制,TRUSTVIS不仅能够提供可靠的评估结果,还能使用户易于理解复杂的评估流程。在Vicuna-7b、Llama2-7b和GPT-3.5等模型上的初步案例研究表明,我们的框架能有效识别模型在安全性与鲁棒性方面的潜在缺陷,同时其交互式界面支持用户深入探索评估结果,从而助力进行有针对性的模型改进。视频链接:https://youtu.be/k1TrBqNVg8g