COSCUP 2025

Let's build a Visual Language Model (VLM) from scratch with Python
09/08/2025 , TR410

Ever wonder how Vision Language Models (VLMs) works? VLMs are built on a vision encoder and a language decoder. It accepts both images and text as inputs and can answer vision-language questions with detailed insights. Building a VLM from scratch allows us to customize the component for our application. The goal of this talk is to demonstrate how VLMs could be implemented in a Pythonic way. To do so, we're going to build the PaliGemma VLM completely from scratch all using Python.


Public cible:

Targeted at developers with intermediate knowledge of Python and Transformers and would like to get familiar with what and how Python can achieve for building VLMs. Additionally, this talk explores how PyTorch can support to implement key components for VLMs.

Niveau de difficulté:

進階

John is a Senior AI Engineer, currently focused on developing NLP applications.

He is deeply motivated by challenges and tends to be excited by breaking conventional ways of thinking and doing. With prior experiences in Software Engineering, he works on combining the latest AI technology and engineering to transform challenges into practical solutions.

Autre(s) intervention(s) de l'orateur :