Operational Data Analytics

image
Modern HPC facilities generate vast, complex streams of operational data—covering everything from hardware performance metrics and job logs to environmental sensor readings. My research and engineering efforts in operational data analytics aims to transform these data torrents into actionable insights. I design and deploy scalable data pipelines capable of ingesting hundreds of gigabytes per day, integrating technologies such as Apache Kafka, Spark, and distributed databases to process and curate real-time and historical telemetry.

Building on these robust data foundations, I develop advanced analytics and visualization tools that help HPC operators detect anomalies, predict failures, and optimize resource usage. Whether it’s diagnosing GPU memory corruption at scale or pinpointing overcooling in data centers, the overarching goal is to enhance system reliability, efficiency, and user productivity. By bridging data engineering with domain knowledge, I help HPC sites adopt a proactive, data-centric approach to managing next-generation supercomputers.

Related Projects
Related Talks and Events