MeetBench-XL: Calibrated Multi-Dimensional Evaluation and Learned Dual-Policy Agents for Real-Time Meetings

Enterprise meeting environments require AI assistants that handle diverse operational tasks, from rapid fact checking during live discussions to cross meeting analysis for strategic planning, under strict latency, cost, and privacy constraints. Existing meeting benchmarks mainly focus on simplified question answering and fail to reflect real world enterprise workflows, where queries arise organically from multi stakeholder collaboration, span long temporal contexts, and require tool augmented reasoning. We address this gap through a grounded dataset and a learned agent framework. First, we introduce MeetAll, a bilingual and multimodal corpus derived from 231 enterprise meetings totaling 140 hours. Questions are injected using an enterprise informed protocol validated by domain expert review and human discriminability studies. Unlike purely synthetic benchmarks, this protocol is grounded in four enterprise critical dimensions: cognitive load, temporal context span, domain expertise, and actionable task execution, calibrated through interviews with stakeholders across finance, healthcare, and technology sectors. Second, we propose MeetBench XL, a multi dimensional evaluation protocol aligned with human judgment that measures factual fidelity, intent alignment, response efficiency, structural clarity, and completeness. Third, we present MeetMaster XL, a learned dual policy agent that jointly optimizes query routing between fast and slow reasoning paths and tool invocation, including retrieval, cross meeting aggregation, and web search. A lightweight classifier enables accurate routing with minimal overhead, achieving a superior quality latency tradeoff over single model baselines. Experiments against commercial systems show consistent gains, supported by ablations, robustness tests, and a real world deployment case study.Resources: https://github.com/huyuelin/MeetBench.

翻译：企业会议环境需要能够在严格延迟、成本与隐私约束下处理多样化运营任务的AI助手，这些任务涵盖从实时讨论中的快速事实核查到跨会议战略规划分析。现有会议基准主要聚焦于简化问答任务，未能反映真实企业工作流程——其中查询产生于多利益相关方协作、跨越长时序上下文且需工具增强推理。我们通过构建基于真实场景的数据集与学习型智能体框架来填补这一空白。首先，我们提出MeetAll，这是一个从总计140小时的231场企业会议中构建的双语多模态语料库。通过经领域专家评审与人类可区分性研究验证的企业级协议注入问题。与纯合成基准不同，该协议基于四个企业关键维度进行校准：认知负荷、时序上下文跨度、领域专业知识及可执行任务操作，这些维度通过对金融、医疗和技术领域利益相关方的访谈确定。其次，我们提出MeetBench-XL，这是一个与人类判断对齐的多维度评估协议，用于衡量事实保真度、意图对齐度、响应效率、结构清晰度与完整性。第三，我们推出MeetMaster-XL，一种联合优化快速与慢速推理路径间查询路由及工具调用（包括检索、跨会议聚合和网络搜索）的学习型双策略智能体。轻量级分类器以最小开销实现精准路由，在质量-延迟权衡上优于单模型基线。针对商业系统的实验显示出稳定优势，并通过消融实验、鲁棒性测试及实际部署案例研究验证。项目资源：https://github.com/huyuelin/MeetBench。