Data discovery and preparation remain persistent bottlenecks in the data management lifecycle, especially when user intent is vague, evolving, or difficult to operationalize. The Pneuma Project introduces Pneuma-Seeker, a system that helps users articulate and fulfill information needs through iterative interaction with a language model-powered platform. The system reifies the user's evolving information need as a relational data model and incrementally converges toward a usable document aligned with that intent. To achieve this, the system combines three architectural ideas: context specialization to reduce LLM burden across subtasks, a conductor-style planner to assemble dynamic execution plans, and a convergence mechanism based on shared state. The system integrates recent advances in retrieval-augmented generation (RAG), agentic frameworks, and structured data preparation to support semi-automatic, language-guided workflows. We evaluate the system through LLM-based user simulations and show that it helps surface latent intent, guide discovery, and produce fit-for-purpose documents. It also acts as an emergent documentation layer, capturing institutional knowledge and supporting organizational memory.
翻译:数据发现与准备始终是数据管理生命周期中的瓶颈环节,尤其在用户意图模糊、动态变化或难以操作化时更为突出。Pneuma项目提出了Pneuma-Seeker系统,该系统通过用户与基于语言模型的平台进行迭代交互,帮助用户阐明并满足信息需求。该系统将用户动态变化的信息需求具象化为关系数据模型,并逐步收敛生成符合该意图的可用文档。为实现这一目标,该系统融合了三大架构理念:通过上下文专业化减轻大语言模型在子任务中的负担,采用指挥家式规划器组装动态执行计划,以及基于共享状态的收敛机制。该系统整合了检索增强生成、智能体框架和结构化数据准备等领域的最新进展,以支持半自动化的语言引导工作流。我们通过基于大语言模型的用户模拟对该系统进行评估,结果表明其能有效挖掘潜在意图、引导数据发现并生成符合用途的文档。该系统同时具备新兴文档层的功能,能够捕获机构知识并支持组织记忆。