Title: Architecting for Resilience at Scale
Speaker: Sudhanva Gurumurthi, Principal Member of the Technical Staff | AMD RAS Architecture
Date:Friday, February 5, 2021
Time: 1:00 - 2:00 pm
Location: Zoom Meeting
Reliability is a fundamental abstraction that underlies computer architecture and systems design. While technology-scaling has enabled the ability to build ever more powerful computers, such scaling has also made it more challenging to maintain this abstraction, especially at scale in a data center. This talk will first explain this problem and motivate the need for reliable and resilient hardware design. The talk will then present three areas of research and advanced development by AMD on this topic. First, the talk will present data and insights from production systems in data centers collected over a multi-year timescale to demonstrate the nature of faults observed in the field. This data spans several generations of DRAM, as well as SRAM. Second, the talk will discuss tools and techniques to model the potential impact of faults during the early stages of design of a processor and to evaluate resilience at the software-level. Finally, the talk will present a low-cost approach to providing resilience to GPUs through compiler-managed redundant multi-threading and discuss our experience from prototyping this technique.
Bio: Sudhanva Gurumurthi is a Principal Member of the Technical Staff at AMD, where he leads advanced development in Reliability, Availability, and Serviceability (RAS). His responsibilities include leading the pathfinding of new RAS features and its technology transfer into the company roadmap, leading the RAS definition of new technologies in industry consortiums, and providing guidance on RAS for emerging technologies and use cases being explored by R&D groups in AMD. He used to be an Associate Professor with tenure in the Computer Science Department at the University of Virginia. Sudhanva is a recipient of the NSF CAREER Award, a Google Focused Research Award, two Google Faculty Research Awards, and several other NSF and industry awards. He received his PhD in Computer Science and Engineering from Penn State in 2005. He is a Senior Member of the IEEE and the ACM.