Power & Energy Monitoring for Frontier (2021 – Present)

Jun 1, 2021 · 2 min read
Analytics and AI Methods at Scale Group (AAIMS), Oak Ridge National Laboratory, USA
Research Staff, July 2021 ~ Present

LVA - System View - Frontier

Overview

The Frontier supercomputer required scalable energy monitoring solutions for exascale operations. This project developed a system capable of handling 100GiB/day of telemetry data for real-time and historical analysis.

  • Engaged the vendor to develop data streams by curating sensors tailored to the needs of the OLCF.

  • Designed and implemented a real-time data ingest and refinement pipeline, producing multi-purpose, reusable, and contextualized job power profile data in near-real time.

  • Developed a visual analytics tool for low-latency, interactive, hierarchical drill-down of multi-year, high-dimensional job power profile data.

  • Enabled diverse use cases, including ticket handling dashboards, facility dashboards, digital twins, and long-term analysis.

Obsidian - ODA Platform

Technology

  • HPC System: Frontier CrayEX - CrayEX telemetry, Slurm json dump
  • Facility: Bacnet/IP, custom devices, Metasys historian
  • Storage: Apache Kafka, Apache Druid cluster, Apache Spark standalone cluster, MinIO S3 object store, PostgreSQL, Zookeeper, REDIS (all deployed on Kubernetes (OpenShift 4))
  • Application platform: Bazel monorepo (rules_gitops, rules_k8s, rules_docker) with gitlab + ArgoCD for CI/CD
  • Kubernetes tooling: kustomize, podman, docker
  • Secrets and config management: Ansible vault, Ansible
  • Data engineering: Python Pyspark structured streaming, parquet, pandas
  • Machine Learning workflow: Pytorch, SciPy, Scikit-learn, MLFlow, Airflow
  • Applications: Python3 - FastAPI, pydantic, pandera, Javascript - ReactJS
  • Misc.: Apache Apisix, Keycloak (Oauth2 + LDAP), Jupyterhub

Publications