OpenTSLM: Time-Series Language Models for Reasoning over Multivariate Medical Text- and Time-Series Data

Patrick Langer,Thomas Kaar,Max Rosenblattl,Maxwell A. Xu,Winnie Chow,Martin Maritsch,Robert Jakob,Ning Wang,Juncheng Liu,Aradhana Verma,Brian Han,Daniel Seung Kim,Henry Chubb,Scott Ceresnak,Aydin Zahedivash,Alexander Tarlochan Singh Sandhu,Fatima Rodriguez,Daniel McDuff,Elgar Fleisch,Oliver Aalami,Filipe Barata,Paul Schmiedmayer

LLMs have emerged as powerful tools for interpreting multimodal data. In medicine, they hold particular promise for synthesizing large volumes of clinical information into actionable insights and digital health applications. Yet, a major limitation remains their inability to handle time series. To overcome this gap, we present OpenTSLM, a family of Time Series Language Models (TSLMs) created by integrating time series as a native modality to pretrained LLMs, enabling reasoning over multiple time series of any length. We investigate two architectures for OpenTSLM. The first, OpenTSLM-SoftPrompt, models time series implicitly by concatenating learnable time series tokens with text tokens via soft prompting. Although parameter-efficient, we hypothesize that explicit time series modeling scales better and outperforms implicit approaches. We thus introduce OpenTSLM-Flamingo, which integrates time series with text via cross-attention. We benchmark both variants against baselines that treat time series as text tokens or plots, across a suite of text-time-series Chain-of-Thought (CoT) reasoning tasks. We introduce three datasets: HAR-CoT, Sleep-CoT, and ECG-QA-CoT. Across all, OpenTSLM models outperform baselines, reaching 69.9 F1 in sleep staging and 65.4 in HAR, compared to 9.05 and 52.2 for finetuned text-only models. Notably, even 1B-parameter OpenTSLM models surpass GPT-4o (15.47 and 2.95). OpenTSLM-Flamingo matches OpenTSLM-SoftPrompt in performance and outperforms on longer sequences, while maintaining stable memory requirements. By contrast, SoftPrompt grows exponentially in memory with sequence length, requiring around 110 GB compared to 40 GB VRAM when training on ECG-QA with LLaMA-3B. Expert reviews by clinicians find strong reasoning capabilities exhibited by OpenTSLMs on ECG-QA. To facilitate further research, we provide all code, datasets, and models open-source.

翻译：大型语言模型已成为解读多模态数据的强大工具。在医学领域，它们尤其有望将大量临床信息综合为可操作的见解和数字健康应用。然而，其主要局限在于无法处理时间序列数据。为克服这一不足，我们提出了OpenTSLM，这是一个通过将时间序列作为原生模态集成到预训练大型语言模型中而创建的时间序列语言模型系列，能够对任意长度的多个时间序列进行推理。我们研究了OpenTSLM的两种架构。第一种是OpenTSLM-SoftPrompt，它通过软提示将可学习的时间序列标记与文本标记隐式地连接起来以建模时间序列。尽管参数高效，但我们假设显式的时间序列建模具有更好的可扩展性并优于隐式方法。因此，我们引入了OpenTSLM-Flamingo，它通过交叉注意力机制将时间序列与文本集成。我们在一系列文本-时间序列思维链推理任务中，将这两种变体与将时间序列视为文本标记或图表的基线模型进行了基准测试。我们引入了三个数据集：HAR-CoT、Sleep-CoT和ECG-QA-CoT。在所有任务中，OpenTSLM模型均优于基线模型，在睡眠分期任务中达到69.9的F1分数，在HAR任务中达到65.4，而仅微调的纯文本模型分别为9.05和52.2。值得注意的是，即使是10亿参数的OpenTSLM模型也超过了GPT-4o（15.47和2.95）。OpenTSLM-Flamingo在性能上与OpenTSLM-SoftPrompt相当，并在更长序列上表现更优，同时保持稳定的内存需求。相比之下，SoftPrompt的内存需求随序列长度呈指数级增长，在ECG-QA数据集上使用LLaMA-3B进行训练时，需要约110 GB的显存，而Flamingo仅需40 GB。临床医生的专家评审发现OpenTSLM在ECG-QA任务上展现出强大的推理能力。为促进进一步研究，我们开源了所有代码、数据集和模型。