Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Description

This project addresses the need to manage the immense computational resources required by Large-Scale Language Models (LLMs). The central objective is to develop a framework that identifies the most appropriate parallelism configuration to maximize training performance. Using the AstraSim tool and its synthetic LLM generator, the research explores the search space to recommend optimal degrees of data, tensor, sequence, and pipeline parallelism. To increase simulation fidelity, the system integrates an optimized network model that accounts for the effects of congestion, providing critical insights into bottlenecks and design trade-offs for AI researchers and hardware engineers.

Background

Mohammad obtained his Master of Science in Computer Science and Engineering from the Indian Institute of Technology Roorkee (2021–2024), sponsored by the ICCR scholarship, where he specialized in computer architecture with a thesis on the development of a dynamic warp scheduler for GPGPUs. Previously, he graduated in Computer Engineering from Tishreen University (Syria), where he received the Al Bassel certificate for academic excellence and developed a machine translation application. Professionally, he has worked as a software engineer and web developer for companies located in Syria and Dubai.

Motivation

His research is driven by hardware limitations in terms of compute, memory, and interconnect bandwidth that hinder the development of more powerful LLMs. He is motivated by the challenge of improving infrastructure utilization and aligning model architectures with the underlying hardware. By designing this parallelism recommendation framework, Mohammad seeks to make artificial intelligence training more resource-efficient, cost-effective, and accessible to the scientific community.

Predoctoral Researcher

Mohammad Nasser

Mohammad Nasser

Degree in Computer Engineering and Master in Computer Architecture, currently a PhD candidate

Host Organization

Supervisors

Sergi Abadal Cavalle

Sergi Abadal Cavalle

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

Optimizing distributed AI workload training on multinode computing systems

Sergi Tomàs Martínez

Sergi Tomàs Martínez

Research Support Investigator

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Tomàs Gadea Alcaide

Tomàs Gadea Alcaide

Research Support Investigator

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

Bernat Ibañez

Bernat Ibañez

Research Support Investigator

High Predictability Global Routing during Floorplanning of Complex Chips

Antoni Pech Alberich

Antoni Pech Alberich

Research Support Investigator

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Yilihamujiang Yimamu

Yilihamujiang Yimamu

Predoctoral Researcher

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Xavier Querol Bassols

Xavier Querol Bassols

Research Support Investigator

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Nuria Elizondo Cereza

Nuria Elizondo Cereza

Research Support Investigator

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning

Guillem Pastor Rué

Guillem Pastor Rué

Research Support Investigator

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.