Multimodal Machine Learning (多模态机器学习)

Undergraduate course, Renmin University of China, 2025

This is a comprehensive course on multimodal machine learning lectured by Prof. Wenbing Huang from GSAI, Renmin University of China. This course predominantly focuses on recent advances in Computer Vision, Vision Language Model (VLM) and Multimodal Large Language Model (MLLM).

The slides and assignments of this course are provided below. In our final project, our team has implemented text-to-image and image-to-text models using multimodal machine learning techniques learnt in this course.

SlidesAssignment Technical ReportCode
Lec 1: Introduction  
Lec 2: Basic Concepts  
Lec 3: Vision Modality and CNNsAssignment 1: CNN 
Lec 4: Language Modality and RNNs Assignment 2: RNN implementation
Lec 5: Multimodal Representation Assignment 3: Fusion and Coordination
Lec 6: Multimodal Alignment Assignment 4: Dynamic Time Wrapping
Lec 7: Multimodal Generation (Part 1)  
Lec 8: Multimodal Generation (Part 2)  
Lec 9: Multimodal Generation (Part 3)  
Lec 10: Multimodal Self-supervised Learning (Part 1)  
Lec 11: Multimodal Self-supervised Learning (Part 2)  
Lec 12: Multimodal Large Language ModelFinal Project: Text-to-Image and Image-to-Text