Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.
翻译:开放式任务因其广阔的解决方案空间而对大语言模型构成特殊挑战,这既需要广泛的探索能力,又要求具备适应性策略,尤其是在成功缺乏明确客观定义的情况下。写作,凭借其庞大的解空间和主观的评价标准,为研究此类问题提供了一个极具价值的测试平台。本文研究了大语言模型作为协作合著者的潜力,使其能够自主建议并实施文本改进。我们分析了三种主流大语言模型——Gemini 1.5 Pro、Claude 3.5 Sonnet 和 GPT-4o,重点关注其行动多样性、人类对齐性以及迭代改进能力如何影响整体表现。这项工作建立了一个用于评估自主写作智能体的基准框架,并在更广泛层面上,揭示了构建能够在多样化开放式领域表现出色的系统所面临的根本性挑战及潜在解决方案。