Evaluating language models and AI agents remains fundamentally challenging because static benchmarks fail to capture real-world uncertainty, distribution shift, and the gap between isolated task accuracy and human-aligned decision-making under evolving conditions. This paper introduces TruthTensor, a novel, reproducible evaluation paradigm that measures reasoning models not only as prediction engines but as human-imitation systems operating in socially-grounded, high-entropy environments. Building on forward-looking, contamination-free tasks, our framework anchors evaluation to live prediction markets and combines probabilistic scoring to provide a holistic view of model behavior. TruthTensor complements traditional correctness metrics with drift-centric diagnostics and explicit robustness checks for reproducibility. It specify human vs. automated evaluation roles, annotation protocols, and statistical testing procedures to ensure interpretability and replicability of results. In experiments across 500+ real markets (political, economic, cultural, technological), TruthTensor demonstrates that models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity, underscoring the need to evaluate models along multiple axes (accuracy, calibration, narrative stability, cost, and resource efficiency). TruthTensor therefore operationalizes modern evaluation best practices, clear hypothesis framing, careful metric selection, transparent compute/cost reporting, human-in-the-loop validation, and open, versioned evaluation contracts, to produce defensible assessments of LLMs in real-world decision contexts. We publicly release TruthTensor at https://truthtensor.com.
翻译:评估语言模型与人工智能代理仍面临根本性挑战,因为静态基准无法捕捉现实世界的不确定性、分布漂移,以及孤立任务准确性与动态条件下人类对齐决策之间的差距。本文提出TruthTensor——一种新颖且可复现的评估范式,该范式不仅将推理模型视为预测引擎,更将其视为在社会化基础的高熵环境中运作的人类模仿系统。基于前瞻性、无污染任务,本框架将评估锚定于实时预测市场,并结合概率评分以提供模型行为的整体视图。TruthTensor通过以漂移为核心的诊断指标和显式鲁棒性检验来补充传统正确性度量,确保结果的可复现性。它明确了人工与自动化评估的角色划分、标注协议及统计检验流程,以保证结果的可解释性与可复现性。在涵盖500多个真实市场(政治、经济、文化、技术)的实验中,TruthTensor表明:具有相似预测准确度的模型在校准性、漂移特性和风险敏感性方面可能显著分化,这凸显了从多维度(准确性、校准度、叙事稳定性、成本及资源效率)评估模型的必要性。为此,TruthTensor将现代评估最佳实践——包括清晰的假设框架、严谨的指标选择、透明的算力/成本报告、人机协同验证以及开放版本化评估协议——操作化,从而为现实决策场景中的LLM提供可验证的评估结果。我们在https://truthtensor.com公开发布TruthTensor。