Portrait of Sifei Liu

Sifei Liu

I am a Principal Research Scientist and Tech Lead at NVIDIA Research, where I work with the LPR team led by Jan Kautz. My current research focuses on embodied foundational models, efficient transformer architectures, and spatial reasoning. I am deeply involved in VLM and VLA foundation model efforts across Cosmos and Isaac GR00T.

Before joining NVIDIA, I earned my Ph.D. at the VLLAB at UC Merced, advised by Ming-Hsuan Yang. I have been fortunate to receive the Baidu Graduate Fellowship, the NVIDIA Pioneering Research Award, and the Rising Star EECS recognition.

Current Focus

Building grounded multimodal systems that can reason, scale, and act in open-world settings.

Embodied foundation models Efficient transformers Spatial reasoning

Research Themes

What I work on

I work on making multimodal systems more grounded, more efficient, and more capable in open-world environments.

Embodied VLM

Perception for agents and robotics

Building multimodal systems that can perceive, reason, and act in 3D environments for navigation and embodied decision making.

Efficient Models

Transformer efficiency at high resolution

Designing token-efficient architectures and attention mechanisms that preserve detail without paying the full compute cost.

Spatial Intelligence

Grounded multimodal understanding

Connecting images, language, and geometry so models can reason about structure, localization, and relationships across views.

News

  • Jun 2026
    We released Cosmos 3, NVIDIA’s omnimodal world foundation model powering the next generation of Physical AI, where I am fortunate to lead its spatial and embodied capability building.
  • Jun 2026
    Our CVPR 2026 paper GR3D unifies 2D and monocular 3D grounding for grounded spatial chain-of-thought reasoning.
  • Apr 2026
    We released LoHo-Manip, a trace-conditioned VLA planning framework for long-horizon robotic manipulation.
  • Feb 2026
    We released Compact GSPN, scaling spatial propagation networks to vision foundation models with nearly 10x lower propagation latency.
  • Jan 2026
    We released SR-3D, a 3D-aware region-prompted VLM for grounded spatial reasoning across views and scenes.
  • Oct 2025
    We released OmniVinci, an omni-modal LLM for joint understanding of vision, audio, and text.
  • Oct 2025
    Our ICCV 2025 paper Describe Anything introduces DAM for detailed localized image and video captioning.
  • Oct 2025
    Our ICCV 2025 paper Token-Efficient VLM presents an efficient VLM for high-resolution visual understanding.
  • Mar 2025
    SpatialRGPT was demoed at GTC 2025 as a part of Agentic AI for Physical Operations!
  • Feb 2025
    We release the GSPN, a fast vision attention module that accelerates Stable Diffusion inference 84x. Stay tuned for more details!
  • Feb 2025
    5 papers was accepted to CVPR 2025! Stay tuned for more updates!
  • Jan 2025
    We released the NaVILA, a navigation agent that can navigate in a 3D environment with a language instruction.
  • Dec 2024
    We presented CosAE at NeurIPS 2024! Stay tuned for code release.
  • Oct 2024
    We released the SpatialRGPT code, datasets, and models! Welcome to try demos!

Selected Publications

NVIDIA Technical Report 2026

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3 is NVIDIA's omnimodal world foundation model that unifies understanding, generation, simulation, and action across text, image, video, audio, and robot actions for Physical AI. I serve as a core contributor on its spatial and embodied capabilities.

arXiv 2026

Long-Horizon Manipulation via Trace-Conditioned VLA Planning

Isabella Liu, An-Chieh Cheng, Rui Yan, Geng Chen, Ri-Zhao Qiu, Xueyan Zou, Sha Yi, Hongxu (Danny) Yin, Xiaolong Wang, Sifei Liu

LoHo-Manip is a modular framework that scales short-horizon vision-language-action policies to long-horizon manipulation, using a task-management VLM that predicts subtask sequences and 2D visual traces to guide the executor with implicit progress tracking, replanning, and recovery.

Grounded 3D-Aware Spatial Vision-Language Modeling
CVPR 2026

Grounded 3D-Aware Spatial Vision-Language Modeling

An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Hongxu (Danny) Yin, Sifei Liu

GR3D is a spatial vision-language model that unifies explicit 2D, implicit 2D, and monocular 3D grounding in a single framework, decomposing spatial understanding into grounded 2D perception followed by 3D inference.

arXiv 2025

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Nariyambut Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu (Danny) Yin, Pavlo Molchanov

OmniVinci is an omni-modal LLM with new architecture (OmniAlignNet, Temporal Embedding Grouping, and Constrained Rotary Time Embedding) and a 24M-conversation data pipeline, achieving state-of-the-art joint understanding of images, video, audio, and text with far fewer training tokens.

Compact GSPN: Scaling Spatial Propagation to Vision Foundation Models
arXiv 2026

Compact GSPN: Scaling Spatial Propagation to Vision Foundation Models

Yitong Jiang, Collin McCarthy, Hongjun Wang, Hanrong Ye, Qi Dou, Tianfan Xue, Jingwei Gu, Jan Kautz, Hongxu (Danny) Yin, Pavlo Molchanov, Sifei Liu

Compact GSPN (C-GSPN) is a ViT block with compressed spatial propagation and fused CUDA kernels that cuts propagation latency by nearly 10x, using a two-stage distillation scheme to scale subquadratic spatial propagation networks to vision foundation models.

ICLR 2026

3D Aware Region Prompted Vision Language Model

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu (Danny) Yin, Xiaolong Wang, Sifei Liu

SR-3D unifies single-view 2D and multi-view 3D representations for flexible region prompting and grounded spatial reasoning.

Describe Anything: Detailed Localized Image and Video Captioning
ICCV 2025

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

DAM generates detailed localized captions for user-specified regions in images and videos, preserving both local detail and global context.

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal
ICCV 2025

Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal

Yitong Jiang, Jingwei Gu, Tianfan Xue, Ka Chun Cheung, Pavlo Molchanov, Hongxu (Danny) Yin, Sifei Liu

TEVA improves high-resolution image understanding by dynamically selecting detail-rich regions while keeping token usage efficient.

Parallel Sequence Modeling via Generalized Spatial Propagation Network
CVPR 2025

Parallel Sequence Modeling via Generalized Spatial Propagation Network

GSPN is a fast vision attention module that accelerates generic vision foundation models for high-resolution input images.

NaVILA: Legged Robot Vision-Language-Action Model for Navigation
arxiv 2025

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

ICLR 2025

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

NeurIPS 2024

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

SpatialRGPT is a grounded spatial reasoning model that can reason about spatial relationships in images.

COLMAP-Free 3D Gaussian Splatting
CVPR 2025

COLMAP-Free 3D Gaussian Splatting

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang

3D Gaussian Splatting without COLMAP computation.

Open-vocabulary panoptic segmentation with text-to-image diffusion models
CVPR 2023

Open-vocabulary panoptic segmentation with text-to-image diffusion models

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation.