Power & Energy Monitoring for Frontier (2021 – Present)
Research Staff, July 2021 ~ Present
Overview
The Frontier supercomputer required scalable energy monitoring solutions for exascale operations. This project developed a system capable of handling 100GiB/day of telemetry data for real-time and historical analysis.
-
Engaged the vendor to develop data streams by curating sensors tailored to the needs of the OLCF.
-
Designed and implemented a real-time data ingest and refinement pipeline, producing multi-purpose, reusable, and contextualized job power profile data in near-real time.
-
Developed a visual analytics tool for low-latency, interactive, hierarchical drill-down of multi-year, high-dimensional job power profile data.
-
Enabled diverse use cases, including ticket handling dashboards, facility dashboards, digital twins, and long-term analysis.
Technology
- HPC System: Frontier CrayEX - CrayEX telemetry, Slurm json dump
- Facility: Bacnet/IP, custom devices, Metasys historian
- Storage: Apache Kafka, Apache Druid cluster, Apache Spark standalone cluster, MinIO S3 object store, PostgreSQL, Zookeeper, REDIS (all deployed on Kubernetes (OpenShift 4))
- Application platform: Bazel monorepo (
rules_gitops
,rules_k8s
,rules_docker
) with gitlab + ArgoCD for CI/CD - Kubernetes tooling: kustomize, podman, docker
- Secrets and config management: Ansible vault, Ansible
- Data engineering: Python Pyspark structured streaming, parquet, pandas
- Machine Learning workflow: Pytorch, SciPy, Scikit-learn, MLFlow, Airflow
- Applications: Python3 - FastAPI, pydantic, pandera, Javascript - ReactJS
- Misc.: Apache Apisix, Keycloak (Oauth2 + LDAP), Jupyterhub
Publications
