St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026 · January 2026
Citation
Wu, P., Liu, Y., Liu, M., Shen, J. (2026). St-think: How Multimodal Large Language Models Reason about 4D Worlds from Ego-Centric Videos. In IEEE/CVF Winter Conference on Applications of Computer Vision.
Peiran Wu, Yiling Liu, Meiyi Liu, Junxiao Shen
Exploring how multimodal large language models can understand and reason about temporal and spatial relationships in egocentric video content.