Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks and examined factuality in knowledge augmented systems, relatively little work studies their intersection: settings where LLMs must integrate large volumes of dynamic, structured and unstructured multi tool outputs. We investigate LLM failure modes in this regime using crypto as a representative high data density domain. We introduce (1) CryptoAnalystBench, an analyst aligned benchmark of 198 production crypto and DeFi queries spanning 11 categories; (2) an agentic harness equipped with relevant crypto and DeFi tools to generate responses across multiple frontier LLMs; and (3) an evaluation pipeline with citation verification and an LLM as a judge rubric spanning four user defined success dimensions: relevance, temporal relevance, depth, and data consistency. Using human annotation, we develop a taxonomy of seven higher order error types that are not reliably captured by factuality checks or LLM based quality scoring. We find that these failures persist even in state of the art systems and can compromise high stakes decisions. Based on this taxonomy, we refine the judge rubric to better capture these errors. While the judge does not align with human annotators on precise scoring across rubric iterations, it reliably identifies critical failure modes, enabling scalable feedback for developers and researchers studying analyst style agents. We release CryptoAnalystBench with annotated queries, the evaluation pipeline, judge rubrics, and the error taxonomy, and outline mitigation strategies and open challenges in evaluating long form, multi tool augmented systems.
翻译:现代分析智能体必须对复杂的高令牌输入进行推理,这些输入包含数十份检索文档、工具输出以及时效性数据。尽管先前的研究已开发出工具调用基准并考察了知识增强系统中的事实准确性,但对其交叉领域——即LLM必须整合大量动态、结构化与非结构化的多工具输出场景——的研究相对较少。我们以加密领域作为高数据密度代表域,探究LLM在此类场景中的失败模式。我们提出:(1)CryptoAnalystBench,一个包含198个实际加密与DeFi查询、涵盖11个类别的分析型对齐基准;(2)配备相关加密与DeFi工具的智能体框架,用于在多个前沿LLM上生成响应;(3)包含引用验证和LLM作为评判标准的评估流程,涵盖用户定义的四个成功维度:相关性、时效相关性、深度和数据一致性。通过人工标注,我们构建了七类高阶错误类型的分类体系,这些错误无法通过事实核查或基于LLM的质量评分可靠捕捉。我们发现,即使在最先进的系统中这些失败依然存在,并可能危及高风险决策。基于此分类体系,我们优化了评判标准以更好地捕捉这些错误。尽管在多次标准迭代中,该评判器在精确评分方面与人工标注者未完全一致,但它能可靠识别关键失败模式,从而为研究分析型智能体的开发者和研究者提供可扩展的反馈。我们发布了包含标注查询、评估流程、评判标准和错误分类体系的CryptoAnalystBench,并概述了评估长文本多工具增强系统的缓解策略与开放挑战。