Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact or novel formats can incur a token overhead driven by grep output density and pattern unfamiliarity, with the magnitude depending on model capability. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.
翻译:大型语言模型智能体日益通过程序化接口操作外部系统,然而从业者缺乏关于如何构建这些智能体所消费上下文的实证指导。以SQL生成为程序化智能体操作的代理任务,我们针对结构化数据的上下文工程开展了系统性研究,涵盖11个模型、4种格式(YAML、Markdown、JSON、面向令牌对象表示法[TOON])以及10至10,000张表的模式,共计9,649次实验。我们的研究结果挑战了常见假设。首先,架构选择具有模型依赖性:基于文件的上下文检索能提升前沿层级模型(Claude、GPT、Gemini;+2.7%,p=0.029)的准确性,但对开源模型则呈现混合结果(总体-7.7%,p<0.001),其准确率下降幅度因模型差异显著。其次,格式对总体准确性无显著影响(卡方值=2.45,p=0.484),但个别模型(特别是开源模型)表现出对特定格式的敏感性。第三,模型能力是主导因素,前沿层级与开源层级之间存在21个百分点的准确率差距,远超任何格式或架构效应。第四,文件原生智能体通过领域分区模式可扩展至10,000张表,同时保持高导航准确性。第五,文件大小不能预测运行时效率:紧凑或新颖格式可能因grep输出密度和模式陌生性产生令牌开销,其幅度取决于模型能力。这些发现为从业者在结构化系统上部署LLM智能体提供了基于证据的指导,表明架构决策应根据模型能力进行定制,而非假定存在普适的最佳实践。