In some fields, scientific data formats differ across experiments due to specialized hardware and data acquisition systems. Researchers need to develop, document, and maintain experiment-specific analysis software to interact with these data formats. These software are often tightly coupled with a particular data format. This proliferation of custom data formats has been a prominent challenge for small to mid-scale experiments. The widespread adoption of ROOT has largely mitigated this problem for the Large Hadron Collider experiments. However, many smaller experiments continue to use custom data formats to meet specific research needs. Therefore, simplifying the process of accessing a unique data format for analysis holds immense value for scientific communities within HEP. We have added Awkward Arrays as a target language for Kaitai Struct for this purpose. Researchers can describe their custom data format in the Kaitai Struct YAML (KSY) language. The Kaitai Struct Compiler generates C++ code to fill the LayoutBuilder buffers using the KSY format. In a few steps, the Kaitai Struct Awkward Runtime API can convert the generated C++ code into a compiled Python module. Finally, the raw data can be passed to the module to produce Awkward Arrays. This paper introduces the Awkward Target for the Kaitai Struct Compiler and the Kaitai Struct Awkward Runtime API. It also demonstrates the conversion of a given KSY for a specific custom file format to Awkward Arrays.
翻译:在某些研究领域,由于专用硬件和数据采集系统的差异,科学数据格式因实验而异。研究人员需要开发、记录和维护针对特定实验的分析软件来处理这些数据格式。这类软件通常与特定的数据格式紧密耦合。这种自定义数据格式的激增一直是中小型实验面临的一个突出挑战。ROOT的广泛采用在很大程度上缓解了大型强子对撞机实验中的这一问题。然而,许多小型实验为了满足特定的研究需求,仍然使用自定义数据格式。因此,简化访问独特数据格式以进行分析的过程,对高能物理领域内的科学界具有重要价值。为此,我们已将Awkward Arrays添加为Kaitai Struct的一种目标语言。研究人员可以使用Kaitai Struct YAML语言描述其自定义数据格式。Kaitai Struct编译器会根据KSY格式生成用于填充LayoutBuilder缓冲区的C++代码。通过几个步骤,Kaitai Struct Awkward运行时API可以将生成的C++代码转换为编译后的Python模块。最后,原始数据可以传递给该模块以生成Awkward Arrays。本文介绍了Kaitai Struct编译器的Awkward目标以及Kaitai Struct Awkward运行时API,并演示了如何将针对特定自定义文件格式的给定KSY转换为Awkward Arrays。