Optimizing distributed AI workload training on multinode computing systems

Description

Large Language Models (LLMs) have demonstrated remarkable capabilities in a wide variety of tasks, making them a central focus of current research in artificial intelligence. However, enabling these models to operate efficiently requires very high computational resources and considerable execution time. One of the main challenges is to determine the optimal parallelism strategy to maximize performance. This project aims to develop a framework capable of identifying the most appropriate parallelism configuration for a given LLM. Using AstraSim and its synthetic LLM generator, the framework systematically explores the search space to recommend optimal degrees of data, tensor, sequence, and pipeline parallelism. In order to further increase the accuracy of the simulation, the framework integrates an optimized network model that takes into account the effects of congestion and other limitations inherent to communication in distributed environments. The results of this project bring value to different professional profiles. On the one hand, artificial intelligence researchers can use them to obtain recommendations on the most efficient parallelism strategies for training LLMs. On the other hand, hardware engineers can benefit from information on design trade-offs and identify potential bottlenecks associated with different architectural choices.

Background

Sergi holds a B.Sc. in Artificial Intelligence from the Polytechnic University of Catalonia (UPC). During his undergraduate studies, he completed a six-month international exchange program at KU Leuven (Belgium), where he joined the Advanced Master in Artificial Intelligence, with a focus on Engineering and Computer Science. His professional career combines industrial application with cutting-edge research. At Aquiles Solutions, he applied machine learning methodologies to solve complex industrial problems, focusing on decision-making and data-driven optimization. At the Barcelona Supercomputing Center, he worked as a researcher in the field of artificial intelligence security and governance, dedicating himself to the development and analysis of evaluation metrics. His research focused on improving the transparency, security and governance frameworks of large-scale AI models and datasets. Finally, at N3Cat (UPC), he conducted research for his bachelor's thesis on the optimization of quantum computing compilers. In this project, he developed new approaches for qubit allocation using Graph Neural Networks (GNNs) and Reinforcement Learning (RL) with the aim of improving hardware efficiency.

Motivation

Sergi constantly challenges himself to step out of his comfort zone to grow both personally and professionally. This project offers him the opportunity to “open the hood” of AI systems, allowing him to bridge the gap between high-level applications and the low-level optimizations he has experienced in previous projects. As LLMs are currently the main driver of the AI industry, improving their efficiency and feasibility in real-world environments is a global priority. Contributing in this area alongside UPC researchers and the Qualcomm team represents an invaluable opportunity for him. Learning from these industry leaders is a challenge he faces with great enthusiasm.

Sergi Tomàs Martínez

Degree and Master's Degree in Artificial Intelligence

Host Organization

Supervisors

Sergi Abadal

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

Optimizing distributed AI workload training on multinode computing systems

Optimizing distributed AI workload training on multinode computing systems

Description

Background

Motivation

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

High Predictability Global Routing during Floorplanning of Complex Chips

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning

Optimizing distributed AI workload training on multinode computing systems

Optimizing distributed AI workload training on multinode computing systems

Description

Background

Motivation

Related projects

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

High Predictability Global Routing during Floorplanning of Complex Chips

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning