GPU Memory Corruption Analysis (2021 - 2023)

Jun 1, 2021 · 2 min read
Analytics and AI Methods at Scale Group (AAIMS), Oak Ridge National Laboratory, USA
Research Staff, July 2021 ~ March 2023

Overview

As high-performance computing (HPC) systems continue to scale, reliability challenges—particularly GPU memory corruption—become increasingly critical. This project investigates double-bit errors (DBEs), one of the most disruptive yet least understood failure modes in large-scale GPU-accelerated systems. By analyzing operational data from the Summit supercomputer, we uncover patterns and contributing factors behind these rare but costly errors.

  • Analyzed GPU memory DBEs across 27,756 V100 GPUs on Summit, studying correlations with power usage, workload types, and GPU placement.
  • Identified key factors contributing to DBE occurrences, including power fluctuation dynamics, application behavior, and GPU utilization patterns.
  • Evaluated the impact of thermal states, finding minimal correlation between high operating temperatures and DBE rates in production settings.

Technology

  • Data sources: High-resolution (1Hz) telemetry, job scheduler logs, GPU error logs from a 2.5-year operational dataset.
  • Statistical & ML methods: Exploratory data analysis, t-tests, survival analysis, interpretable ML models.
  • Computing platforms: Summit supercomputer, NVIDIA Tesla V100 GPUs with HBM2 memory.
  • Key tools: Python (NumPy, Pandas, Scikit-learn), PyTorch, HPC job log analytics, power and thermal monitoring frameworks.

Publications