GPU Memory Corruption Analysis (2021 - 2023)

Jun 1, 2021 · 2 min read

Analytics and AI Methods at Scale Group (AAIMS), Oak Ridge National Laboratory, USA
Research Staff, July 2021 ~ March 2023

Overview

As high-performance computing (HPC) systems continue to scale, reliability challenges—particularly GPU memory corruption—become increasingly critical. This project investigates double-bit errors (DBEs), one of the most disruptive yet least understood failure modes in large-scale GPU-accelerated systems. By analyzing operational data from the Summit supercomputer, we uncover patterns and contributing factors behind these rare but costly errors.

Analyzed GPU memory DBEs across 27,756 V100 GPUs on Summit, studying correlations with power usage, workload types, and GPU placement.
Identified key factors contributing to DBE occurrences, including power fluctuation dynamics, application behavior, and GPU utilization patterns.
Evaluated the impact of thermal states, finding minimal correlation between high operating temperatures and DBE rates in production settings.

Technology

Data sources: High-resolution (1Hz) telemetry, job scheduler logs, GPU error logs from a 2.5-year operational dataset.
Statistical & ML methods: Exploratory data analysis, t-tests, survival analysis, interpretable ML models.
Computing platforms: Summit supercomputer, NVIDIA Tesla V100 GPUs with HBM2 memory.
Key tools: Python (NumPy, Pandas, Scikit-learn), PyTorch, HPC job log analytics, power and thermal monitoring frameworks.

Publications

Last updated on Jun 1, 2021