Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Description

This project focuses on the development of a simulation-based recommendation framework to optimize the training of Large-Scale Language Models (LLMs) in distributed systems. The research addresses the challenge of balancing computing, memory and communication resources by using state-of-the-art tools such as AstraSim and STG. The system allows exploring a vast design space to identify the best parallelism strategies (Data, Tensor, Sequence and Pipeline) and evaluate the impact of different network topologies, such as Folded-Clos. The ultimate goal is to provide data-driven recommendations that maximize throughput while minimizing power and memory consumption.

Motivation

His research is driven by the need to overcome hardware limitations that hinder progress towards more powerful and scalable LLMs. He is particularly motivated by the ability to create tools that "connect" the design of AI models with the design of the physical system (hardware), allowing a drastic reduction in the exploration time of the design phase. His work seeks to enable systems engineers to make efficient decisions for distributed training that is more sustainable and accessible.
Tomàs Gadea Alcaide

Tomàs Gadea Alcaide

Degree in Engineering and Data Science

Host Organization

Supervisors

Jordi Cortadella

Jordi Cortadella

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.