Multimodal Large Language Models (MLLMs) enhance collaboration in Extended Reality (XR) environments by enabling flexible object and animation creation through the combination of natural language and visual inputs. However, visual data captured by XR headsets includes real-world backgrounds that may contain irrelevant or sensitive user information, such as credit cards left on the table or facial identities of other users. Uploading those frames to cloud-based MLLMs poses serious privacy risks, particularly when such data is processed without explicit user consent. Additionally, existing colocation and synchronization mechanisms in commercial XR APIs rely on time-consuming, privacy-invasive environment scanning and struggle to adapt to the highly dynamic nature of MLLM-integrated XR environments. In this paper, we propose PRISM-XR, a novel framework that facilitates multi-user collaboration in XR by providing privacy-aware MLLM integration. PRISM-XR employs intelligent frame preprocessing on the edge server to filter sensitive data and remove irrelevant context before communicating with cloud generative AI models. Additionally, we introduce a lightweight registration process and a fully customizable content-sharing mechanism to enable efficient, accurate, and privacy-preserving content synchronization among users. Our numerical evaluation results indicate that the proposed platform achieves nearly 90% accuracy in fulfilling user requests and less than 0.27 seconds registration time while maintaining spatial inconsistencies of less than 3.5 cm. Furthermore, we conducted an IRB-approved user study with 28 participants, demonstrating that our system could automatically filter highly sensitive objects in over 90% of scenarios while maintaining strong overall usability.
翻译:多模态大语言模型通过结合自然语言与视觉输入,支持灵活的对象与动画创建,从而增强了扩展现实环境中的协同体验。然而,XR头戴设备捕获的视觉数据包含真实世界背景,其中可能涉及无关或敏感的用户信息,例如遗留在桌面上的信用卡或其他用户的面部身份。将这些画面帧上传至基于云的多模态大语言模型会带来严重的隐私风险,尤其是在未经用户明确同意的情况下处理此类数据时。此外,现有商用XR API中的共置与同步机制依赖于耗时且侵犯隐私的环境扫描,难以适应集成多模态大语言模型后高度动态的XR环境。本文提出PRISM-XR,一种通过提供隐私感知的多模态大语言模型集成来促进XR中多用户协同的新颖框架。PRISM-XR在边缘服务器上实施智能画面帧预处理,在与云端生成式AI模型通信前过滤敏感数据并移除无关上下文。同时,我们引入轻量级的注册流程与完全可定制的内容共享机制,以实现用户间高效、准确且保护隐私的内容同步。数值评估结果表明,所提平台在满足用户请求方面达到近90%的准确率,注册时间低于0.27秒,同时将空间不一致性保持在3.5厘米以内。此外,我们开展了一项经IRB批准、包含28名参与者的用户研究,证明本系统能在超过90%的场景中自动过滤高度敏感对象,同时保持优异的整体可用性。