St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026 · January 2026

Citation

Wu, P., Liu, Y., Liu, M., Shen, J. (2026). St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos. In IEEE/CVF Winter Conference on Applications of Computer Vision.

Peiran Wu, Yiling Liu, Meiyi Liu, Junxiao Shen

Exploring how multimodal large language models can understand and reason about temporal and spatial relationships in egocentric video content.

← Back to publications