Long-Term Analysis of Summit’s Power & Thermal Dynamics (2020 – 2021)

Oct 1, 2020 · 1 min read
Analytics and AI Methods at Scale Group (AAIMS), Oak Ridge National Laboratory, USA
Research Staff, October 2020 ~ June 2021

Overview

This project analyzed extensive operational data to understand Summit’s power, energy, and thermal behavior, leveraging insights from the Cooling Intelligence for Summit initiative. The findings provided a foundation for ML-based methods in energy efficiency.

  • Conducted a comprehensive study on power consumption at component, node, and system levels across all 4,626 Summit compute nodes.
  • Analyzed over 840,000 Summit jobs and 250,000 GPU failure logs, uncovering operational insights.
  • Processed high-frequency 1Hz telemetry data spanning the entire year of 2020.
  • Authored and led a best paper award-winning study at SC21, the premier conference in HPC.
  • Research team received the UT-Battelle Research Accomplishment Award (2022) for contributions to HPC energy efficiency.

Technology

  • OLCF Summit Supercomputer Telemetry (2021)
  • OLCF Andes cluster
  • Dask, parquet, pandas
  • Jupyter, seaborn

Publication