Cooling Intelligence for Summit (2018 – 2024)
Research Associate, May 2018 ~ November 2024
Overview
Developed and maintained a near real-time monitoring and analytics system to optimize cooling efficiency and reduce energy consumption for the Summit supercomputer.
- Integrated facility and system telemetry to provide real-time visibility into Summit’s cooling and power systems.
- Enabled data-driven decision-making by field engineers, leading to significant cooling energy savings by addressing overcooling inefficiencies.
- Supported continuous operations and maintenance, ensuring data quality and system reliability over Summit’s lifetime.
Technology
- IBM OpenBMC Telemetry streaming
- IBM LSF and IBM CSM
- Python 3 for Data collection daemons, web to Kafka conversion, compression, data processing
- HA configuration for Kafka, Zookeeper, Etcd, Prometheus on Kubernetes (OpenShift 3)
- Grafana
Publications
- Thaler et al., “Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics”, HPEC'20
- Ott et al., “Global Experiences with HPC Operational Data Measurement, Collection and Analysis”, EHPCWG State of Practice Workshop @ CLUSTER'20
Coverage
- OLCF, “OLCF and Tech Company Providentia Worldwide Build Intelligence System for Supercomputer Cooling Plant”, 2019
- HPCWire, “OLCF and Providentia Worldwide Build System for Supercomputer Cooling Plant”, 2019
- insideHPC, “Providentia Worldwide Builds Intelligence into Summit Supercomputer Cooling”, 2019
