Elastic deep learning through resilient collective operations

TitleElastic deep learning through resilient collective operations
Publication TypeConference Paper
Year of Publication2023
AuthorsLi, J., G. Bosilca, A. Bouteiller, and B. Nicolae
Conference NameSC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
Date Published2023-11
PublisherACM
Conference LocationDenver, CO
ISBN Number9798400707858
Abstract

A robust solution that incorporates fault tolerance and elastic scaling capabilities for distributed deep learning. Taking advantage of MPI resilient capabilities, aka. User-Level Failure Mitigation (ULFM), this novel approach promotes efficient and lightweight failure management and encourages smooth scaling in volatile computational settings. The proposed ULFM MPI-centered mechanism outperforms the only officially supported elastic learning framework, Elastic Horovod (using Gloo and NCCL), by a significant factor. These results reinforce the capability of MPI extension to deal with resiliency, and promote ULFM as an effective technique for fault management, minimizing downtime, and thereby enhancing the overall performance of distributed applications, in particular elastic training in high-performance computing (HPC) environments and machine learning applications.

URLhttps://dl.acm.org/doi/abs/10.1145/3624062.3626080
DOI10.1145/3624062.3626080
Project Tags: 
External Publication Flag: