AI for HPC Operations (2024 - Present)

Sep 1, 2024 · 1 min read
Analytics and AI Methods at Scale Group (AAIMS), Oak Ridge National Laboratory, USA
Research Staff, September 2024 ~ Present

AI

Overview

Bringing AI into HPC operations requires translating unstructured machine data into structured insights. This project explores how LLMs can automate data science workflows, reducing the dependency on domain-specific expertise.

  • Built an LLM-based query system allowing operators to interact with HPC telemetry using natural language
  • Designed a self-adaptive AutoML system for operational forecasting, reducing the barrier for predictive analytics in HPC centers.
  • Established an autonomous model registry and retraining pipeline for continuously evolving operational intelligence.

Technology

  • Python frameworks: Langchain, Chainlit, pydantic, ChromaDB, LiteLLM, HuggingFace
  • Retrieval Augmented Generation ingest pipeline and context
  • Pytorch based predictive ANN
  • AutoML frameworks (AutoGluon Tabular and TimeSeries)
  • A combination local LLM models such as Llama 3.1 and 3.2 and large OpenAI models
  • Context aware tool call driven hierarchical LLM chains structure
  • DuckDB based SQL interface to parquet based telemetry data
  • MLflow based experiment and model tracking

Publications

  • Publication preparation targeting SC25 (Work in progress)
Woong Shin
Authors
HPC Systems/Software/Data Engineer, Computer Systems Researcher