AI for HPC Operations (2024 - Present)

Sep 1, 2024 · 1 min read

Analytics and AI Methods at Scale Group (AAIMS), Oak Ridge National Laboratory, USA
Research Staff, September 2024 ~ Present

Overview

Bringing AI into HPC operations requires translating unstructured machine data into structured insights. This project explores how LLMs can automate data science workflows, reducing the dependency on domain-specific expertise.

Built an LLM-based query system allowing operators to interact with HPC telemetry using natural language
Designed a self-adaptive AutoML system for operational forecasting, reducing the barrier for predictive analytics in HPC centers.
Established an autonomous model registry and retraining pipeline for continuously evolving operational intelligence.

Technology

Python frameworks: Langchain, Chainlit, pydantic, ChromaDB, LiteLLM, HuggingFace
Retrieval Augmented Generation ingest pipeline and context
Pytorch based predictive ANN
AutoML frameworks (AutoGluon Tabular and TimeSeries)
A combination local LLM models such as Llama 3.1 and 3.2 and large OpenAI models
Context aware tool call driven hierarchical LLM chains structure
DuckDB based SQL interface to parquet based telemetry data
MLflow based experiment and model tracking

Publications

Publication preparation targeting SC25 (Work in progress)

Last updated on Sep 1, 2024

Applications of AI/ML

Authors

Woong Shin

HPC Systems/Software/Data Engineer, Computer Systems Researcher

Projects →