Evaluating language models and AI agents remains fundamentally challenging because static benchmarks fail to capture real-world uncertainty, distribution shift, and the gap between isolated task accuracy and human-aligned decision-making under evolving conditions. This paper introduces TruthTensor, a novel, reproducible evaluation paradigm that measures reasoning models not only as prediction engines but as human-imitation systems operating in socially-grounded, high-entropy environments. Building on forward-looking, contamination-free tasks, our framework anchors evaluation to live prediction markets and combines probabilistic scoring to provide a holistic view of model behavior. TruthTensor complements traditional correctness metrics with drift-centric diagnostics and explicit robustness checks for reproducibility. It specify human vs. automated evaluation roles, annotation protocols, and statistical testing procedures to ensure interpretability and replicability of results. In experiments across 500+ real markets (political, economic, cultural, technological), TruthTensor demonstrates that models with similar forecast accuracy can diverge markedly in calibration, drift, and risk-sensitivity, underscoring the need to evaluate models along multiple axes (accuracy, calibration, narrative stability, cost, and resource efficiency). TruthTensor therefore operationalizes modern evaluation best practices, clear hypothesis framing, careful metric selection, transparent compute/cost reporting, human-in-the-loop validation, and open, versioned evaluation contracts, to produce defensible assessments of LLMs in real-world decision contexts. We publicly released TruthTensor at https://truthtensor.com.
翻译:评估语言模型与人工智能代理仍面临根本性挑战,因为静态基准测试无法捕捉现实世界中的不确定性、分布偏移,以及孤立任务准确率与动态条件下人类对齐决策之间的差距。本文提出TruthTensor——一种新颖且可复现的评估范式,该框架不仅将推理模型视为预测引擎,更将其置于社会基础的高熵环境中作为人类模仿系统进行评估。基于前瞻性、无污染任务的设计,本框架将评估锚定于实时预测市场,并结合概率评分以提供模型行为的整体视图。TruthTensor通过漂移中心诊断和显式稳健性检验来补充传统正确性指标,确保结果的可复现性。它明确了人类与自动化评估的角色划分、标注协议及统计检验流程,以保证结果的可解释性与可复现性。在涵盖500多个真实市场(政治、经济、文化、技术领域)的实验中,TruthTensor表明:具有相似预测准确率的模型在校准度、漂移特性和风险敏感性方面可能存在显著差异,这凸显了从多维度(准确率、校准度、叙事稳定性、成本及资源效率)评估模型的必要性。因此,TruthTensor通过实施现代评估最佳实践——包括清晰的假设框架构建、审慎的指标选择、透明的算力/成本报告、人机协同验证以及开放版本化的评估协议——为现实决策场景中的大语言模型提供可辩护的评估体系。我们已在https://truthtensor.com公开发布TruthTensor。