A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.
翻译:一个实用的电话代理需要具备个人智能。它应当能够基于设备上存储的用户身份、历史记录与偏好进行推理,而非仅能在非个性化沙盒中执行独立指令。现有移动代理基准测试缺乏此类个性化能力。我们提出iOSWorld——首个基于持续用户身份构建的交互式原生iOS模拟器基准测试平台,该平台横跨26个全新开发的iOS应用。这些应用包含相互关联的数据,如交易记录、消息、出行记录、社交关系及金融活动。iOSWorld包含133项任务,按难度递增分为三个类别:单应用任务(27项)测试单一应用,多应用任务(60项)覆盖2至8个应用,记忆与个性化任务(46项)则要求代理从个人数据中推断模式。我们分别在纯视觉模式与特权视觉+XML模式下评估前沿及开源计算机应用模型。最佳配置在整体任务上达到52%的成功率,但在多应用任务中仅为37%。特权视觉+XML访问使前沿模型性能提升最多26个百分点,而较小模型并未从附加的辅助功能树输入中获益。我们以开源形式发布iOSWorld,包含所有应用、种子数据、任务、评分标准及评估代码。