Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Description

This project focuses on improving the operational efficiency of Large-Scale Language Models (LLMs), which require massive computational resources. The goal is to develop a framework that identifies the optimal parallelism strategy to maximize performance in distributed training. Using AstraSim and its synthetic LLM generator, the research explores degrees of data, tensor, sequence, and pipeline parallelism. In addition, an optimized network model that considers the effects of congestion is integrated to provide accurate recommendations to both AI researchers and hardware engineers on design trade-offs and bottlenecks.

Background

Xavier is a Master's student in Artificial Intelligence at the Polytechnic University of Catalonia (UPC) and a graduate in Data Engineering from the Autonomous University of Barcelona (UAB). His main interests lie in machine learning and computer vision. In the professional field, he has accumulated experience as a Data Engineer at Zurich Insurance, a data scientist at Bonarea IT and a developer of AI document processing solutions at Serimag, providing him with a comprehensive view of the data lifecycle, from infrastructure to model deployment.

Motivation

The choice of this research responds to the desire to acquire experience in the field of research before completing his university studies, applying previous practical knowledge from an analytical perspective. He is particularly motivated by the current relevance of LLMs and the challenge of optimizing their training time and reducing their energy consumption, key factors for the sustainability of modern AI systems.

Xavier Querol Bassols

Degree in Data Engineering and Master in Artificial Intelligence

Host Organization

Supervisors

Sergi Abadal Cavalle

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Description

Background

Motivation

Optimizing distributed AI workload training on multinode computing systems

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

High Predictability Global Routing during Floorplanning of Complex Chips

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Description

Background

Motivation

Related projects

Optimizing distributed AI workload training on multinode computing systems

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

High Predictability Global Routing during Floorplanning of Complex Chips

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning