St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos

Published in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, 2026

Recommended citation: Wu, P., Liu, Y., Liu, M., Shen, J. (2026). St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos. In IEEE/CVF Winter Conference on Applications of Computer Vision. https://arxiv.org/abs/2503.12542

Peiran Wu, Yiling Liu, Meiyi Liu, Junxiao Shen

Exploring how multimodal large language models can understand and reason about temporal and spatial relationships in egocentric video content.