AI for HPC Operations (2024 - Present)
Research Staff, September 2024 ~ Present
Overview
Bringing AI into HPC operations requires translating unstructured machine data into structured insights. This project explores how LLMs can automate data science workflows, reducing the dependency on domain-specific expertise.
- Built an LLM-based query system allowing operators to interact with HPC telemetry using natural language
- Designed a self-adaptive AutoML system for operational forecasting, reducing the barrier for predictive analytics in HPC centers.
- Established an autonomous model registry and retraining pipeline for continuously evolving operational intelligence.
Technology
- Python frameworks: Langchain, Chainlit, pydantic, ChromaDB, LiteLLM, HuggingFace
- Retrieval Augmented Generation ingest pipeline and context
- Pytorch based predictive ANN
- AutoML frameworks (AutoGluon Tabular and TimeSeries)
- A combination local LLM models such as Llama 3.1 and 3.2 and large OpenAI models
- Context aware tool call driven hierarchical LLM chains structure
- DuckDB based SQL interface to parquet based telemetry data
- MLflow based experiment and model tracking
Publications
- Publication preparation targeting SC25 (Work in progress)
