Tensor is the most basic and essential data structure of nowadays artificial intelligence (AI) system. The natural properties of Tensor, especially the memory-continuity and slice-independence, make it feasible for training system to leverage parallel computing unit like GPU to process data simultaneously in batch, spatial or temporal dimensions. However, if we look beyond perception tasks, the data in a complicated cognitive AI system usually has hierarchical structures (i.e. nested data) with various modalities. They are inconvenient and inefficient to program directly with conventional Tensor with fixed shape. To address this issue, we summarize two main computational patterns of nested data, and then propose a general nested data container: TreeTensor. Through various constraints and magic utilities of TreeTensor, one can apply arbitrary functions and operations to nested data with almost zero cost, including some famous machine learning libraries, such as Scikit-Learn, Numpy and PyTorch. Our approach utilizes a constrained tree-structure perspective to systematically model data relationships, and it can also easily be combined with other methods to extend more usages, such as asynchronous execution and variable-length data computation. Detailed examples and benchmarks show TreeTensor not only provides powerful usability in various problems, especially one of the most complicated AI systems at present: AlphaStar for StarCraftII, but also exhibits excellent runtime efficiency without any overhead. Our project is available at https://github.com/opendilab/DI-treetensor.
翻译:张量是当今人工智能系统中最基本且至关重要的数据结构。张量固有的特性,尤其是内存连续性和切片独立性,使得训练系统能够利用GPU等并行计算单元在批处理、空间或时间维度上同时处理数据。然而,若将视野扩展至感知任务之外,复杂认知AI系统中的数据通常具有层次化结构(即嵌套数据)并包含多种模态。使用固定形状的传统张量直接编程处理此类数据既不便捷也效率低下。为解决这一问题,我们首先总结了嵌套数据的两种主要计算模式,进而提出了一种通用的嵌套数据容器:TreeTensor。通过TreeTensor的各种约束条件和便捷工具,用户可以几乎零成本地对嵌套数据施加任意函数和操作,包括兼容Scikit-Learn、Numpy和PyTorch等主流机器学习库。该方法采用约束树形结构视角系统化建模数据关系,并能轻松与其他技术结合以扩展更多功能,例如异步执行和变长数据计算。详尽的案例与基准测试表明,TreeTensor不仅能为各类问题提供强大的可用性(特别是在当前最复杂的AI系统之一——星际争霸II的AlphaStar中),而且在运行效率上表现出色且无额外开销。本项目开源地址:https://github.com/opendilab/DI-treetensor。