Triplet Feature Fusion for Equipment Anomaly Prediction : An Open-Source Methodology Using Small Foundation Models

Predicting equipment anomalies before they escalate into failures is a critical challenge in industrial facility management. Existing approaches rely either on hand-crafted threshold rules, which lack generalizability, or on large neural models that are impractical for on-site, air-gapped deployments. We present an industrial methodology that resolves this tension by combining open-source small foundation models into a unified 1,116-dimensional Triplet Feature Fusion pipeline. This pipeline integrates: (1) statistical features (x in $R^{28}$) derived from 90-day sensor histories, (2) time-series embeddings (y in $R^{64}$) from a LoRA-adapted IBM Granite TinyTimeMixer (TTM, 133K parameters), and (3) multilingual text embeddings (z in $R^{1024}$) extracted from Japanese equipment master records via multilingual-e5-large. The concatenated triplet h = [x; y; z] is processed by a LightGBM classifier (< 3 MB) trained to predict anomalies at 30-, 60-, and 90-day horizons. All components use permissive open-source licenses (Apache 2.0 / MIT). The inference-time pipeline runs entirely on CPU in under 2 ms, enabling edge deployment on co-located hardware without cloud dependency. On a dataset of 64 HVAC units comprising 67,045 samples, the triplet model achieves Precision = 0.992, F1 = 0.958, and ROC-AUC = 0.998 at the 30-day horizon. Crucially, it reduces the False Positive Rate from 0.6 percent (baseline) to 0.1 percent - an 83 percent reduction attributable to equipment-type conditioning via text embedding z. Cluster analysis reveals that the embeddings align time-series signatures with distinct fault archetypes, explaining how compact multilingual representations improve discrimination without explicit categorical encoding.

翻译：在工业设施管理中，在设备异常发展为故障前进行预测是一项关键挑战。现有方法要么依赖缺乏泛化能力的人工设定阈值规则，要么依赖不适用于现场隔离部署的大型神经网络模型。我们提出一种工业方法论，通过将开源小型基础模型整合到统一的1116维三重特征融合流水线中来解决这一矛盾。该流水线整合了：（1）从90天传感器历史数据中提取的统计特征（x ∈ R²⁸），（2）经LoRA适配的IBM Granite TinyTimeMixer（TTM，133K参数）生成的时间序列嵌入（y ∈ R⁶⁴），以及（3）通过multilingual-e5-large从日语设备主记录中提取的多语言文本嵌入（z ∈ R¹⁰²⁴）。拼接后的三重特征h = [x; y; z]由轻量级LightGBM分类器（<3 MB）处理，该分类器经过训练可预测30天、60天和90天时间窗口内的异常。所有组件均采用宽松开源许可协议（Apache 2.0 / MIT）。推理流水线在CPU上完全运行且耗时低于2毫秒，支持在无云依赖的本地硬件上进行边缘部署。在包含67,045个样本的64台HVAC机组数据集上，三重模型在30天窗口内实现了精确率0.992、F1分数0.958和ROC-AUC 0.998。关键在于，它将假阳性率从基线方法的0.6%降至0.1%——通过文本嵌入z实现的设备类型条件化贡献了83%的降幅。聚类分析表明，这些嵌入将时间序列特征与不同故障原型对齐，揭示了紧凑的多语言表示如何在无需显式类别编码的情况下提升判别能力。