Multimodal Machine Learning (多模态机器学习)

Undergraduate course, Renmin University of China, 2025

This is a comprehensive course on multimodal machine learning lectured by Prof. Wenbing Huang from GSAI, Renmin University of China. This course predominantly focuses on recent advances in Computer Vision, Vision Language Model (VLM) and Multimodal Large Language Model (MLLM).

The slides and assignments of this course are provided below. In our final project, our team has implemented text-to-image and image-to-text models using multimodal machine learning techniques learnt in this course.

Slides	Assignment Technical Report	Code
Lec 1: Introduction
Lec 2: Basic Concepts
Lec 3: Vision Modality and CNNs	Assignment 1: CNN
Lec 4: Language Modality and RNNs		Assignment 2: RNN implementation
Lec 5: Multimodal Representation		Assignment 3: Fusion and Coordination
Lec 6: Multimodal Alignment		Assignment 4: Dynamic Time Wrapping
Lec 7: Multimodal Generation (Part 1)
Lec 8: Multimodal Generation (Part 2)
Lec 9: Multimodal Generation (Part 3)
Lec 10: Multimodal Self-supervised Learning (Part 1)
Lec 11: Multimodal Self-supervised Learning (Part 2)
Lec 12: Multimodal Large Language Model	Final Project: Text-to-Image and Image-to-Text

Share on

Twitter Facebook LinkedIn

Shukai Gong

Share on