Histogramming is often taken for granted, but the power and compactness of partially aggregated, multidimensional summary statistics, and their fundamental connection to differential and integral calculus make them formidable statistical objects, especially when very large data volumes are involved. But expressing these concepts robustly and efficiently in high-dimensional parameter spaces and for large data samples is a highly non-trivial challenge -- doubly so if the resulting library is to remain usable by scientists as opposed to software engineers. In this paper we summarise the core principles required for consistent generalised histogramming, and use them to motivate the design principles and implementation mechanics of the re-engineered YODA histogramming library, a key component of physics data-model comparison and statistical interpretation in collider physics.
翻译:直方图统计常被视为理所当然,但部分聚合的多维摘要统计量所具备的强大功能与紧凑性,及其与微分和积分运算的根本联系,使其成为处理海量数据时尤为重要的统计对象。然而,在高维参数空间和大规模数据样本中稳健高效地实现这些概念,是一项极具挑战性的任务——若要使最终构建的库仍能被科研人员(而非仅软件工程师)便捷使用,则难度倍增。本文系统阐述了实现一致性广义直方图统计的核心原理,并以此为基础阐释了重构版YODA直方图库的设计理念与实现机制,该库是对撞机物理中数据-模型比较与统计解释的关键组件。