COSCUP 2025

Shivay Lamba

Shivay Lamba is a software developer specializing in DevOps, Machine Learning and Full Stack Development.

He is an Open Source Enthusiast and has been part of various programs like Google Code In and Google Summer of Code as a Mentor and has also been a MLH Fellow.
He is actively involved in community work as well. He is a TensorflowJS SIG member, Mentor in OpenMined and CNCF Service Mesh Community, SODA Foundation and has given talks at various conferences like Github Satellite, Voice Global, Fossasia Tech Summit, TensorflowJS Show & Tell.


Session

08/10
10:40
30min
Multi Cluster GPU Allocation for AI Research
Hrittik Roy, Shivay Lamba

As the LLMs and generative models become more and more complex, one can't simply train them on CPU, or a single GPU cluster, this requires the use of multiple GPUs but managing those can be complicated.GPU partitioning in the cloud is perceived to be a complicated, resource-consuming process that is worth the exclusive involvement of narrowly focused teams or large enterprises. So this talk explores why GPU partitioning is necessary for running Python AI workloads and how it can be done efficiently using open source tooling.

The talk will cover about some common myths: that this has something to do with advanced hardware configurations or prohibitive costs, on systems likeKubernetes

In this talk, we will illustrate how modern frameworks like NVIDIA MIG with vCluster effectively enable seamless sharing of GPUs across different teams, leading to more efficient resource utilization, higher throughput, and broader accessibility for workloads like LLM finetuning and inference. The talk aims to inspire developers, engineers to understand the key techniques for efficient GPU scheduling and sharing of resources across multiple GPU Clusters with open source platform tooling like vCluster.

Open Source AI and Machine Learning
AU