Cooling Intelligence for Summit (2018 – 2024)

Jun 1, 2018 · 1 min read
Technology Integration Group, Oak Ridge National Laboratory, USA
Research Associate, May 2018 ~ November 2024

Overview

Developed and maintained a near real-time monitoring and analytics system to optimize cooling efficiency and reduce energy consumption for the Summit supercomputer.

  • Integrated facility and system telemetry to provide real-time visibility into Summit’s cooling and power systems.
  • Enabled data-driven decision-making by field engineers, leading to significant cooling energy savings by addressing overcooling inefficiencies.
  • Supported continuous operations and maintenance, ensuring data quality and system reliability over Summit’s lifetime.

Technology

  • IBM OpenBMC Telemetry streaming
  • IBM LSF and IBM CSM
  • Python 3 for Data collection daemons, web to Kafka conversion, compression, data processing
  • HA configuration for Kafka, Zookeeper, Etcd, Prometheus on Kubernetes (OpenShift 3)
  • Grafana

Publications

Coverage