Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Description

This project focuses on improving the operational efficiency of Large-Scale Language Models (LLMs), which require massive computational resources. The goal is to develop a framework that identifies the optimal parallelism strategy to maximize performance in distributed training. Using AstraSim and its synthetic LLM generator, the research explores degrees of data, tensor, sequence, and pipeline parallelism. In addition, an optimized network model that considers the effects of congestion is integrated to provide accurate recommendations to both AI researchers and hardware engineers on design trade-offs and bottlenecks.

Background

Xavier is a Master's student in Artificial Intelligence at the Polytechnic University of Catalonia (UPC) and a graduate in Data Engineering from the Autonomous University of Barcelona (UAB). His main interests lie in machine learning and computer vision. In the professional field, he has accumulated experience as a Data Engineer at Zurich Insurance, a data scientist at Bonarea IT and a developer of AI document processing solutions at Serimag, providing him with a comprehensive view of the data lifecycle, from infrastructure to model deployment.

Motivation

The choice of this research responds to the desire to acquire experience in the field of research before completing his university studies, applying previous practical knowledge from an analytical perspective. He is particularly motivated by the current relevance of LLMs and the challenge of optimizing their training time and reducing their energy consumption, key factors for the sustainability of modern AI systems.

Research Support Investigator

Xavier Querol Bassols

Xavier Querol Bassols

Degree in Data Engineering and Master in Artificial Intelligence

Host Organization

Supervisors

Sergi Abadal Cavalle

Sergi Abadal Cavalle

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

Optimizing distributed AI workload training on multinode computing systems

Sergi Tomàs Martínez

Sergi Tomàs Martínez

Research Support Investigator

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Tomàs Gadea Alcaide

Tomàs Gadea Alcaide

Research Support Investigator

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

Bernat Ibañez

Bernat Ibañez

Research Support Investigator

High Predictability Global Routing during Floorplanning of Complex Chips

Antoni Pech Alberich

Antoni Pech Alberich

Research Support Investigator

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Yilihamujiang Yimamu

Yilihamujiang Yimamu

Predoctoral Researcher

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Nuria Elizondo Cereza

Nuria Elizondo Cereza

Research Support Investigator

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Mohammad Nasser

Mohammad Nasser

Predoctoral Researcher

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning

Guillem Pastor Rué

Guillem Pastor Rué

Research Support Investigator

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.