With recent empirical observations, it has been argued that the most significant aspect of developing accurate language models may be the proper dataset content and training strategy compared to the number of neural parameters, training duration or dataset size. Following this argument, we opted to fine tune a one billion parameter size trained general purpose causal language model with a dataset curated on team statistics of the Italian football league first ten game weeks, using low rank adaptation. The limited training dataset was compiled based on a framework where a powerful commercial large language model provides distilled paragraphs and question answer pairs as intended. The training duration was kept relatively short to provide a basis for our minimal setting exploration. We share our key observations on the process related to developing a specific purpose language model which is intended to interpret soccer data with constrained resources in this article.
翻译:近期实验观察表明,与神经参数数量、训练时长或数据集规模相比,语料内容与训练策略的恰当性可能才是开发精准语言模型最关键的因素。基于这一论点,我们选择使用低秩适配方法,对经意大利足球联赛前十个比赛周球队统计数据语料微调过的十亿参数通用因果语言模型进行二次优化。该有限训练数据集遵循特定框架构建:由强大的商业大语言模型提供预设的精炼文本段落与问答对。为奠定最小化设置研究基础,训练时长被刻意控制在相对较短的范围内。本文分享了在受限资源条件下开发面向足球数据解读的专用语言模型过程中的关键观察。