Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Description

This project focuses on the development of a simulation-based recommendation framework to optimize the training of Large-Scale Language Models (LLMs) in distributed systems. The research addresses the challenge of balancing computing, memory and communication resources by using state-of-the-art tools such as AstraSim and STG. The system allows exploring a vast design space to identify the best parallelism strategies (Data, Tensor, Sequence and Pipeline) and evaluate the impact of different network topologies, such as Folded-Clos. The ultimate goal is to provide data-driven recommendations that maximize throughput while minimizing power and memory consumption.

Motivation

His research is driven by the need to overcome hardware limitations that hinder progress towards more powerful and scalable LLMs. He is particularly motivated by the ability to create tools that "connect" the design of AI models with the design of the physical system (hardware), allowing a drastic reduction in the exploration time of the design phase. His work seeks to enable systems engineers to make efficient decisions for distributed training that is more sustainable and accessible.

Tomàs Gadea Alcaide

Degree in Engineering and Data Science

Host Organization

Supervisors

Jordi Cortadella

UPC Supervisor

The content of this website reflects only the views of the Catedra Chip Chair UPC project.

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Description

Motivation

Optimizing distributed AI workload training on multinode computing systems

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

High Predictability Global Routing during Floorplanning of Complex Chips

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Simulation-Based Recommendation Framework for Scalable Training of Distributed AI

Description

Motivation

Related projects

Optimizing distributed AI workload training on multinode computing systems

Hierarchical floorplanning optimization algorithms for System-on-Chip architectures

High Predictability Global Routing during Floorplanning of Complex Chips

Mathematical Optimization Techniques for Hierarchical Floorplanning of Complex Chips

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems.

Development of an Interactive Graphical Tool for Optimization and Editing of Floorplanning in Chip Design

Optimization of Training Distributed AI Workloads on Multi-node Computing Systems

Development of Mathematical and Heuristic Optimization Tools for Chip Floorplanning