Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Description

This project addresses the need to manage the immense computational resources required by Large-Scale Language Models (LLMs). The central objective is to develop a framework that identifies the most appropriate parallelism configuration to maximize training performance. Using the AstraSim tool and its synthetic LLM generator, the research explores the search space to recommend optimal degrees of data, tensor, sequence, and pipeline parallelism. To increase simulation fidelity, the system integrates an optimized network model that accounts for the effects of congestion, providing critical insights into bottlenecks and design trade-offs for AI researchers and hardware engineers.

Background

Mohammad obtained his Master of Science in Computer Science and Engineering from the Indian Institute of Technology Roorkee (2021–2024), sponsored by the ICCR scholarship, where he specialized in computer architecture with a thesis on the development of a dynamic warp scheduler for GPGPUs. Previously, he graduated in Computer Engineering from Tishreen University (Syria), where he received the Al Bassel certificate for academic excellence and developed a machine translation application. Professionally, he has worked as a software engineer and web developer for companies located in Syria and Dubai.

Motivation

His research is driven by hardware limitations in terms of compute, memory, and interconnect bandwidth that hinder the development of more powerful LLMs. He is motivated by the challenge of improving infrastructure utilization and aligning model architectures with the underlying hardware. By designing this parallelism recommendation framework, Mohammad seeks to make artificial intelligence training more resource-efficient, cost-effective, and accessible to the scientific community.
Mohammad Nasser

Mohammad Nasser

Degree in Computer Engineering and Master in Computer Architecture, currently a PhD candidate

Host Organization

Supervisors

Sergi Abadal Cavalle

Sergi Abadal Cavalle

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.