Long-Term Analysis of Summit’s Power & Thermal Dynamics (2020 – 2021)
Research Staff, October 2020 ~ June 2021
Overview
This project analyzed extensive operational data to understand Summit’s power, energy, and thermal behavior, leveraging insights from the Cooling Intelligence for Summit initiative. The findings provided a foundation for ML-based methods in energy efficiency.
- Conducted a comprehensive study on power consumption at component, node, and system levels across all 4,626 Summit compute nodes.
- Analyzed over 840,000 Summit jobs and 250,000 GPU failure logs, uncovering operational insights.
- Processed high-frequency 1Hz telemetry data spanning the entire year of 2020.
- Authored and led a best paper award-winning study at SC21, the premier conference in HPC.
- Research team received the UT-Battelle Research Accomplishment Award (2022) for contributions to HPC energy efficiency.
Technology
- OLCF Summit Supercomputer Telemetry (2021)
- OLCF Andes cluster
- Dask, parquet, pandas
- Jupyter, seaborn
Publication
