GPU Memory Corruption Analysis (2021 - 2023)
Research Staff, July 2021 ~ March 2023
Overview
As high-performance computing (HPC) systems continue to scale, reliability challenges—particularly GPU memory corruption—become increasingly critical. This project investigates double-bit errors (DBEs), one of the most disruptive yet least understood failure modes in large-scale GPU-accelerated systems. By analyzing operational data from the Summit supercomputer, we uncover patterns and contributing factors behind these rare but costly errors.
- Analyzed GPU memory DBEs across 27,756 V100 GPUs on Summit, studying correlations with power usage, workload types, and GPU placement.
- Identified key factors contributing to DBE occurrences, including power fluctuation dynamics, application behavior, and GPU utilization patterns.
- Evaluated the impact of thermal states, finding minimal correlation between high operating temperatures and DBE rates in production settings.
Technology
- Data sources: High-resolution (1Hz) telemetry, job scheduler logs, GPU error logs from a 2.5-year operational dataset.
- Statistical & ML methods: Exploratory data analysis, t-tests, survival analysis, interpretable ML models.
- Computing platforms: Summit supercomputer, NVIDIA Tesla V100 GPUs with HBM2 memory.
- Key tools: Python (NumPy, Pandas, Scikit-learn), PyTorch, HPC job log analytics, power and thermal monitoring frameworks.
Publications
