Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact or novel formats can incur a token overhead driven by grep output density and pattern unfamiliarity, with the magnitude depending on model capability. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.

翻译：大型语言模型智能体日益通过程序化接口操作外部系统，然而从业者缺乏关于如何构建这些智能体所消费上下文的实证指导。以SQL生成为程序化智能体操作的代理任务，我们针对结构化数据的上下文工程开展了系统性研究，涵盖11个模型、4种格式（YAML、Markdown、JSON、面向令牌对象表示法[TOON]）以及10至10,000张表的模式，共计9,649次实验。我们的研究结果挑战了常见假设。首先，架构选择具有模型依赖性：基于文件的上下文检索能提升前沿层级模型（Claude、GPT、Gemini；+2.7%，p=0.029）的准确性，但对开源模型则呈现混合结果（总体-7.7%，p<0.001），其准确率下降幅度因模型差异显著。其次，格式对总体准确性无显著影响（卡方值=2.45，p=0.484），但个别模型（特别是开源模型）表现出对特定格式的敏感性。第三，模型能力是主导因素，前沿层级与开源层级之间存在21个百分点的准确率差距，远超任何格式或架构效应。第四，文件原生智能体通过领域分区模式可扩展至10,000张表，同时保持高导航准确性。第五，文件大小不能预测运行时效率：紧凑或新颖格式可能因grep输出密度和模式陌生性产生令牌开销，其幅度取决于模型能力。这些发现为从业者在结构化系统上部署LLM智能体提供了基于证据的指导，表明架构决策应根据模型能力进行定制，而非假定存在普适的最佳实践。