Publications
See also my Google Scholar.
2026
-
arXiv preprint arXiv:2601.14895Unified 3D memory architecture with metric anchoring and fast retrieval for spatial AI systems — language grounding and question-answering over long-horizon video.
-
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026Novel approach to video quality assessment using caption-embedded multimodal perception for compressed video analysis without reference frames.
-
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026Exploring how multimodal large language models can understand and reason about temporal and spatial relationships in egocentric video content.
-
International Conference on Learning Representations (ICLR) 2026This work presents MARC, a novel approach for efficient video understanding through memory-augmented reinforcement learning token compression, enabling better processing of long-form video content.
2025
-
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2025 — FindingsComprehensive benchmark for evaluating AI models on extremely long egocentric video sequences, addressing challenges in temporal video understanding.
-
IEEE/CVF International Conference on Computer Vision (ICCV) 2025Large-scale dataset of cultural landmarks and terrains designed for advancing Gaussian-based scene rendering techniques in computer vision.
-
arXiv preprint arXiv:2509.16811Prompt-driven agentic system that autonomously comprehends and edits long-form, story-driven video — bridging LLM planning and video understanding.
-
arXiv preprint arXiv:2507.11336An omni captioning model and new benchmarks for detailed description of user-generated video content across diverse domains.
-
arXiv preprint arXiv:2503.12332Scalable Mamba-based autoregressive pretraining recipe for long-form video understanding.
-
arXiv preprint arXiv:2502.15228Universal time-series motion-recognition pipeline that automates model selection and training across heterogeneous sensor streams.
-
arXiv preprint arXiv:2502.12297Low-latency streaming gesture-recognition framework for on-device real-time input in XR and wearable settings.
2024
-
arXiv preprint arXiv:2411.12778A temporal computing platform that maintains persistent visual memory across long-horizon interactions — the technical vision underpinning the Memories.ai product.
-
IEEE Transactions on Visualization and Computer Graphics (TVCG), 30(11):7441–7451 (ISMAR 2024)Ring-based mid-air gesture typing system for AR glasses, achieving 25 WPM at 96% accuracy using a smart-ring input device and a deep-learning word-prediction framework. Meta patent filed.
-
IEEE Transactions on Visualization and Computer Graphics (TVCG), 30(11):7118–7128 (ISMAR 2024)First pre-trained foundation model for word-gesture decoding in XR. Trajectory coarse discretization plus pre-training generalizes across keyboard layouts, improving accuracy by ~40% over prior methods.
-
arXiv preprint arXiv:2411.00489Comprehensive survey establishing a human-inspired taxonomy for AI long-term memory. Widely cited foundational reference for emerging work on persistent AI systems.
-
IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2024First open-world continual-learning framework for gesture recognition — handles unseen gesture classes and distribution shifts in deployed XR systems.
-
IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2024Foundational encode–store–retrieve architecture for AI memory augmentation via language-encoded egocentric perception. The core technology behind the Memories.ai platform.
-
IEEE Transactions on Visualization and Computer Graphics (TVCG), 30(9):6493–6506Comprehensive evaluation of controller-based raycasting methods for text entry in virtual reality environments, improving alphanumeric and special character input efficiency.
-
IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2024Automatic gesture-annotation framework that reduces manual labelling effort by roughly 90% while measurably improving downstream gesture-recognition accuracy. FG is the leading venue for face and gesture recognition.
2023
-
IEEE Transactions on Visualization and Computer Graphics (TVCG), 29(11):4622–4632Fast and robust mid-air gesture typing for AR headsets using 3D trajectory decoding. The core framework has been adopted by 15+ research groups internationally. Demo video.
-
arXiv preprint arXiv:2310.08101A conversational and autonomous agent that generates prompts on-the-fly for LLM-powered intelligent text-entry techniques.
-
ACM Conference on Human Factors in Computing Systems (CHI) 2023First framework for explainable AI in augmented reality — surfacing AI decision-making through augmented visual overlays. Published at CHI, the premier HCI venue (acceptance rate ~25%).
2022
-
IEEE International Symposium on Mixed and Augmented Reality (ISMAR) 2022Personalization of a mid-air gesture keyboard using multi-objective Bayesian optimization to trade off typing speed and accuracy for each user.
-
IEEE Transactions on Visualization and Computer Graphics (TVCG), 28(11):3618–3628 (presented at ISMAR 2022)Open-source toolkit enabling rapid prototyping and evaluation of key-gesture spotting in VR/AR. Demo video.
-
27th International Conference on Intelligent User Interfaces (IUI) 2022First context-aware multi-turn dialogue system for augmentative and alternative communication. KWickChat expands a small bag of keywords into fluent, context-appropriate sentences — improving text-entry speed for people who rely on AAC.
-
International Conference on Learning Representations (ICLR) 2022A context-dependent reinforcement-learning method using a Hierarchical Dirichlet Process prior to handle discrete Markovian context evolution. Published at ICLR, a top-tier machine-learning venue.
2021
-
2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG)An imaginative GAN that automatically augments skeleton-based motion data, improving downstream hand-gesture and human-action recognition accuracy.
-
2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)Simulation of realistic human motion trajectories for mid-air gesture typing, enabling data-driven AR text-entry research without costly user studies.