St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos
Published in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, 2026
Recommended citation: Wu, P., Liu, Y., Liu, M., Shen, J. (2026). St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos. In IEEE/CVF Winter Conference on Applications of Computer Vision. https://arxiv.org/abs/2503.12542
Peiran Wu, Yiling Liu, Meiyi Liu, Junxiao Shen
Exploring how multimodal large language models can understand and reason about temporal and spatial relationships in egocentric video content.
