Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.
翻译:尽管JSON等半结构化数据格式广受欢迎并被广泛使用,但直接应用于此类数据的端到端监督学习仍处于探索不足的状态。我们提出了ORIGAMI(通过生成式自回归建模的对象表示),这是一种基于Transformer的架构,能够直接处理嵌套的键/值对,同时保持其层次语义。我们的关键技术贡献包括:(1)一种结构保持型分词器,(2)一种新颖的键/值位置编码方案,以及(3)一个确保输出有效并加速训练收敛的语法约束训练与推理框架。这些增强功能实现了对半结构化数据的高效端到端建模。通过将分类任务重新定义为下一词元预测,ORIGAMI无需修改架构即可自然处理单标签和多标签任务。跨多个领域的实证评估证明了ORIGAMI的有效性:在转换为JSON的标准表格基准测试中,ORIGAMI与传统方法及最先进方法相比仍具竞争力。在原生JSON数据集上,我们在多标签分类任务中优于基线方法,并在代码分类任务中优于卷积神经网络和图神经网络等专用模型。通过广泛的消融实验,我们验证了每个架构组件的影响,并确立了ORIGAMI作为半结构化数据端到端学习的稳健框架。