迈向观测数据湖仓：软件行为的动态交互式档案库 (Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior)

from arxiv, 5 pages, 2 tables, 1 figure, accepted at the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2026 Tool Demo)

Code-generating LLMs are trained largely on static artifacts (source, comments, specifications) and rarely on materializations of run-time behavior. As a result, they readily internalize buggy or mislabeled code. Since non-trivial semantic properties are undecidable in general, the only practical way to obtain ground-truth functionality is by dynamic observation of executions. In prior work, we addressed representation with Sequence Sheets, Stimulus-Response Matrices (SRMs), and Stimulus-Response Cubes (SRCs) to capture and compare behavior across tests, implementations, and contexts. These structures make observation data analyzable offline and reusable, but they do not by themselves provide persistence, evolution, or interactive analytics at scale. In this paper, therefore, we introduce observation lakehouses that operationalize continual SRCs: a tall, append-only observations table storing every actuation (stimulus, response, context) and SQL queries that materialize SRC slices on demand. Built on Apache Parquet + Iceberg + DuckDB, the lakehouse ingests data from controlled pipelines (LASSO) and CI pipelines (e.g., unit test executions), enabling n-version assessment, behavioral clustering, and consensus oracles without re-execution. On a 509-problem benchmark, we ingest $\approx$8.6M observation rows ($<$51MiB) and reconstruct SRM/SRC views and clusters in $<$100ms on a laptop, demonstrating that continual behavior mining is practical without a distributed cluster of machines. This makes behavioral ground truth first-class alongside other run-time data and provides an infrastructure path toward behavior-aware evaluation and training. The Observation Lakehouse, together with the accompanying dataset, is publicly available as an open-source project on GitHub: https://github.com/SoftwareObservatorium/observation-lakehouse

翻译：代码生成大语言模型主要基于静态工件（源代码、注释、规格说明）进行训练，极少利用运行时行为的具体化材料。这导致模型容易内化存在缺陷或标签错误的代码。由于非平凡语义属性在一般情况下是不可判定的，获取真实功能性的唯一实用方法是通过对执行过程的动态观测。在先前工作中，我们通过序列表单、刺激-响应矩阵和刺激-响应立方体等表示方法，实现了跨测试、实现和上下文的行为捕获与比较。这些结构使观测数据可离线分析并重复利用，但其本身并未提供大规模持久化存储、演化机制或交互式分析能力。为此，本文提出观测数据湖仓架构，将持续性刺激-响应立方体操作化：构建一个纵向、仅追加的观测表存储所有执行记录（刺激、响应、上下文），并通过按需物化立方体切片的SQL查询实现动态分析。该架构基于Apache Parquet + Iceberg + DuckDB技术栈，能够从受控流水线和持续集成流水线（如单元测试执行）中摄取数据，支持无需重新执行的N版本评估、行为聚类及共识预言机构建。在包含509个问题的基准测试中，我们摄入了约860万行观测数据（＜51MiB），并在笔记本电脑上实现了＜100毫秒的刺激-响应矩阵/立方体视图重建与聚类，证明持续行为挖掘无需依赖分布式计算集群即可实现。这使得行为真值首次能够与其他运行时数据同等重要，并为行为感知的评估与训练提供了基础设施路径。观测数据湖仓及其配套数据集已在GitHub上作为开源项目公开：https://github.com/SoftwareObservatorium/observation-lakehouse