Videos

Compute Through the Chaos, Asynchrony and Fault Tolerance in Iterative Methods

Presenter
May 4, 2026
Abstract
Heterogeneous HPC platforms and emerging distributed computing settings, including power-grid networks, autonomous drones, and edge AI, increasingly operate under limited bandwidth, variable latency, and imperfect reliability. In such environments, global synchronization is expensive, and classical bulk-synchronous numerical methods can be fragile in the presence of delays and faults. These challenges are amplified by modern hardware trends, where low-precision arithmetic and silent data corruption can introduce persistent perturbations that standard algorithms were not designed to absorb. This talk presents a mathematical view of asynchronous and reduced-synchronization iterative methods for core tasks in scientific computing, including linear systems, eigenvalue problems, and optimization. I will describe our work at LLNL on designing algorithms that remove or relax synchronization points while incorporating fault-tolerant modifications that help detect when corrupted or noisy updates (for example additive perturbations and sporadically corrupted components) begin to compromise convergence. Rather than relying solely on system-level remedies such as checkpointing, these approaches embed lightweight resilience mechanisms directly into the iteration, enabling progress under unreliable and heterogeneous execution. The goal is to connect provable guarantees with implementable methods for next-generation, fault-prone computing. LLNL-ABS-2017288