Skip navigation links

May

27

Zoom

Doctoral Defense - Lan Wang

the famous Belmont tower facing a sunset

About the Event

The Department of Computer Science & Engineering
Michigan State University
Ph.D. Dissertation Defense
May 27th, 2025 at 11:00 am EST
Zoom Meeting: Contact the Department or Advisor for Zoom Information

ABSTRACT

Learning, Seeing, and Highlighting with Multimodal Models

By: Lan Wang
Advisor: Dr. Vishnu Boddeti

 

Multimodal models integrate heterogeneous data sources such as text, images, audio, and video to support tasks that rely on information from multiple modalities. By jointly analyzing these modalities, they achieve more comprehensive and robust performance compared to unimodal approaches. Despite recent progress, several fundamental challenges in multimodal models remain unresolved. First, it is unclear how multimodal models understand and learn across different downstream tasks, and whether their learned representations are fair or can be less biased. Secondly, what does the ‘world’ imagined by multimodal models look like through generation, and can this synthesized information enhance their multimodal understanding ability? Finally, with the explosive growth of visual content, can multimodal models efficiently highlight useful information from massive data streams?

 

This dissertation  investigate how multimodal models learn across different downstream tasks and achieve unbiased representations, see the generated content before understanding, and highlight important content from extensive vision data. For learning, we enhance video temporal grounding by introducing an untrimmed pretraining method with a similarity-based grounding module, and we promote fairer representations in vision-language models like CLIP by jointly debiasing image and text features in RKHS. To enable models to see imagined content, we develop a video editing framework that answers “what-if” questions through generation. We further investigate whether multimodal models can understanding better by seeing generated comtent by proposing a composed image retrieval method that leverages pseudo-targets imagined from input text and images. Finally, to highlight key information in long videos, we propose a semantic learning framework that identifies context-rich tokens, improving efficiency and understanding in applications like temporal grounding and video QA.

 

Tags

Doctoral Defenses

Date

Tuesday, May 27, 2025

Time

11:00 AM

Location

Zoom

Organizer

Lan Wang