Histogramming is often taken for granted, but the power and compactness of partially aggregated, multidimensional summary statistics, and their fundamental connection to differential and integral calculus make them formidable statistical objects, especially when very large data volumes are involved. But expressing these concepts robustly and efficiently in high-dimensional parameter spaces and for large data samples is a highly non-trivial challenge -- doubly so if the resulting library is to remain usable by scientists as opposed to software engineers. In this paper we summarise the core principles required for consistent generalised histogramming, and use them to motivate the design principles and implementation mechanics of the re-engineered YODA histogramming library, a key component of physics data-model comparison and statistical interpretation in collider physics.
翻译:直方图常被视为理所当然,但部分聚合的多维汇总统计量及其与微积分的基本联系,使其成为强大的统计工具——尤其是在涉及极大数据量时。然而,在高维参数空间中针对大规模数据样本稳健且高效地表达这些概念,是一项极具挑战性的任务——若期望由此生成的库能够被科学家而非软件工程师所使用,则难度更甚。本文总结了一致性广义直方图所需的核心原则,并以此为基础阐述了重构后的YODA直方图库的设计原理与实现机制,该库是对撞机物理中数据-模型比较与统计解释的关键组成部分。