Optimizing distributed AI workload training on multinode computing systems

Optimizing distributed AI workload training on multinode computing systems

Description

Large Language Models (LLMs) have demonstrated remarkable capabilities in a wide variety of tasks, making them a central focus of current research in artificial intelligence. However, enabling these models to operate efficiently requires very high computational resources and considerable execution time. One of the main challenges is to determine the optimal parallelism strategy to maximize performance. This project aims to develop a framework capable of identifying the most appropriate parallelism configuration for a given LLM. Using AstraSim and its synthetic LLM generator, the framework systematically explores the search space to recommend optimal degrees of data, tensor, sequence, and pipeline parallelism. In order to further increase the accuracy of the simulation, the framework integrates an optimized network model that takes into account the effects of congestion and other limitations inherent to communication in distributed environments. The results of this project bring value to different professional profiles. On the one hand, artificial intelligence researchers can use them to obtain recommendations on the most efficient parallelism strategies for training LLMs. On the other hand, hardware engineers can benefit from information on design trade-offs and identify potential bottlenecks associated with different architectural choices.

Background

Sergi holds a B.Sc. in Artificial Intelligence from the Polytechnic University of Catalonia (UPC). During his undergraduate studies, he completed a six-month international exchange program at KU Leuven (Belgium), where he joined the Advanced Master in Artificial Intelligence, with a focus on Engineering and Computer Science. His professional career combines industrial application with cutting-edge research. At Aquiles Solutions, he applied machine learning methodologies to solve complex industrial problems, focusing on decision-making and data-driven optimization. At the Barcelona Supercomputing Center, he worked as a researcher in the field of artificial intelligence security and governance, dedicating himself to the development and analysis of evaluation metrics. His research focused on improving the transparency, security and governance frameworks of large-scale AI models and datasets. Finally, at N3Cat (UPC), he conducted research for his bachelor's thesis on the optimization of quantum computing compilers. In this project, he developed new approaches for qubit allocation using Graph Neural Networks (GNNs) and Reinforcement Learning (RL) with the aim of improving hardware efficiency.

Motivation

Sergi constantly challenges himself to step out of his comfort zone to grow both personally and professionally. This project offers him the opportunity to “open the hood” of AI systems, allowing him to bridge the gap between high-level applications and the low-level optimizations he has experienced in previous projects. As LLMs are currently the main driver of the AI ​​industry, improving their efficiency and feasibility in real-world environments is a global priority. Contributing in this area alongside UPC researchers and the Qualcomm team represents an invaluable opportunity for him. Learning from these industry leaders is a challenge he faces with great enthusiasm.

Research Support Investigator

Sergi Tomàs Martínez

Sergi Tomàs Martínez

Degree and Master's Degree in Artificial Intelligence

Host Organization

Supervisors

Sergi Abadal

Sergi Abadal

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Tomàs Gadea Alcaide

Tomàs Gadea Alcaide

Research Support Investigator

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

Bernat Ibañez

Bernat Ibañez

Research Support Investigator

High Predictability Global Routing during Floorplanning of Complex Chips

Antoni Pech Alberich

Antoni Pech Alberich

Research Support Investigator

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Yilihamujiang Yimamu

Yilihamujiang Yimamu

Predoctoral Researcher

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Xavier Querol Bassols

Xavier Querol Bassols

Research Support Investigator

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Nuria Elizondo Cereza

Nuria Elizondo Cereza

Research Support Investigator

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Mohammad Nasser

Mohammad Nasser

Predoctoral Researcher

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning

Guillem Pastor Rué

Guillem Pastor Rué

Research Support Investigator

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.