With recent advances in multi-modal foundation models, the previously text-only large language models (LLM) have evolved to incorporate visual input, opening up unprecedented opportunities for various applications in visualization. Our work explores the utilization of the visual perception ability of multi-modal LLMs to develop Autonomous Visualization Agents (AVAs) that can interpret and accomplish user-defined visualization objectives through natural language. We propose the first framework for the design of AVAs and present several usage scenarios intended to demonstrate the general applicability of the proposed paradigm. The addition of visual perception allows AVAs to act as the virtual visualization assistant for domain experts who may lack the knowledge or expertise in fine-tuning visualization outputs. Our preliminary exploration and proof-of-concept agents suggest that this approach can be widely applicable whenever the choices of appropriate visualization parameters require the interpretation of previous visual output. Feedback from unstructured interviews with experts in AI research, medical visualization, and radiology has been incorporated, highlighting the practicality and potential of AVAs. Our study indicates that AVAs represent a general paradigm for designing intelligent visualization systems that can achieve high-level visualization goals, which pave the way for developing expert-level visualization agents in the future.
翻译:随着多模态基础模型的最新进展,以往仅支持文本输入的大型语言模型(LLM)已发展为能够整合视觉输入,为可视化领域的各种应用开辟了前所未有的机遇。本研究探索利用多模态LLM的视觉感知能力,开发能够通过自然语言理解并完成用户定义可视化目标的自主可视化智能体(Autonomous Visualization Agents, AVAs)。我们首次提出了AVAs的设计框架,并展示了若干应用场景,以论证所提出范式的广泛适用性。视觉感知的引入使AVAs能够充当虚拟可视化助手,帮助那些可能缺乏精细调整可视化输出所需知识或专业技能的领域专家。我们的初步探索与概念验证智能体表明,该方法可广泛适用于需要借助解读先前视觉输出来选择恰当可视化参数的场景。本研究整合了对人工智能研究、医学可视化和放射学领域专家的非结构化访谈反馈,突出了AVAs的实用性与潜力。研究表明,AVAs代表了一种设计能够实现高层级可视化目标的智能可视化系统的通用范式,为未来开发专家级可视化智能体奠定了基础。