Education
University of California, Merced
Ph.D. in EECS, advised by Ming-Hsuan Yang.
Ph.D. in EECS, advised by Ming-Hsuan Yang.
Fall 2012 - Fall 2017
University of Science and Technology of China
M.S. in Electronic Engineering and Information Sciences, advised by Stan Z. Li.
M.S. in Electronic Engineering and Information Sciences, advised by Stan Z. Li.
Fall 2008 - Spring 2011
Employment
NVIDIA
Researcher
Researcher
Nov 2017 - Current
Santa Clara, CA
Santa Clara, CA
NVIDIA
Research Intern with NVIDIA Research.
Research Intern with NVIDIA Research.
March 2017 - Aug 2017
Santa Clara, CA
Santa Clara, CA
Chinese University of HongKong
Visiting Scholar at the MMLAB.
Visiting Scholar at the MMLAB.
Jul 2016 - Dec 2016
Hong Kong
Hong Kong
Baidu Inc.
Applied Scientist Intern on IDL. Worked on face parsing and beautification apps.
Applied Scientist Intern on IDL. Worked on face parsing and beautification apps.
May 2013 - Jan 2016
Beijing, China
Beijing, China
Workshop and tutorial organization
New Frontiers for Learning with Limited Labels or Data
ECCV 2020. (Co-orgainizor and Speaker)
ECCV 2020. (Co-orgainizor and Speaker)
Learning Representations via Graph-structured Networks Tutorial
CVPR 2019 and 2020. (Co-orgainizor and Speaker)
CVPR 2019 and 2020. (Co-orgainizor and Speaker)
(Co)-Mentees at NVIDIA Research
- Xueting Li, 2018-2020, Senior Research Scientist at NVIDIA
- Donghong Lee, 2018, Apple Inc.
- Wei-Chih Hung, 2019, Staff Research Scientist at Waymo
- Hung-Yu Tseng, 2019, Research Scientist at Meta
- Wuyang Chen, 2020, Assistant Professor at Simon Fraser University
- Wenling (Wendy) Shang, 2019, Research Engineer
- Siva Karthik Mustikovela, 2019 and 2021, Senior Research Scientist at Cruise
- Xitong Yang, 2019, Research Scientist at Meta
- Yang Fu, 2019 and 2021, Research Scientist at NVIDIA Research
- Jiteng Mu, 2022, Researcher at Adobe
- Jiashun Wang, 2022, PhD student at Carnegie Mellon University
- Qiushan Guo, 2021-2022, Researcher at ByteDance
- Jiarui Xu, 2021-2022, Researcher at OpenAI
- Yufei (Judy) Ye, 2022, Postdoctoral Scholar at Stanford
Recent Publications
3D Aware Region Prompted Vision Language Model
SR-3D connects 2D and 3D representation spaces for flexible region prompting and grounded spatial reasoning.
ICLR, 2026
Describe Anything: Detailed Localized Image and Video Captioning
DAM generates detailed localized image and video captions for user-specified regions while preserving local detail and global context.
ICCV, 2025
Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal
TEVA uses dynamic region proposal and adaptive patch sampling to handle high-resolution image understanding efficiently.
ICCV, 2025
Parallel Sequence Modeling via Generalized Spatial Propagation Network
GSPN is a spatially coherent attention mechanism for vision models that improves high-resolution efficiency and fidelity.
CVPR, 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Omni-RGPT unifies region-level understanding for images and videos with token marks and region-aware video instruction tuning.
CVPR, 2025
BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations
BlobGEN-Vid introduces blob video representations for controllable compositional text-to-video generation.
CVPR, 2025
Scaling Vision Pre-Training to 4K Resolution
PS3 scales CLIP-style vision pre-training to 4K resolution with near-constant cost for stronger high-resolution perception.
CVPR, 2025
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
NaVILA combines vision-language-action modeling with locomotion policies for embodied navigation.
arXiv, 2025
NVILA: Efficient Frontier Visual Language Models
NVILA is an efficient frontier family of visual language models with strong training and inference efficiency.
CVPR, 2025
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
NoPoSplat reconstructs 3D Gaussian splats from sparse unposed images with a simple pose-free pipeline.
ICLR, 2025
COLMAP-Free 3D Gaussian Splatting
COLMAP-Free 3D Gaussian Splatting enables view synthesis without SfM preprocessing by progressively growing 3D Gaussians from video.
CVPR, 2024
CosAE: Learnable Fourier Series for Image Restoration
CosAE is a cosine autoencoder for image restoration that uses Fourier coefficients for compact yet detail-preserving representations.
NeurIPS, 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
SpatialRGPT improves grounded spatial reasoning in VLMs through regional representation learning and depth-aware modeling.
NeurIPS, 2024
RegionGPT: Towards Region Understanding Vision Language Model
RegionGPT enhances region-level captioning and understanding by improving spatial awareness in vision-language models.
CVPR, 2024
Dream-in-4D: A Unified Approach for Text- and Image-guided 4D Scene Generation
Dream-in-4D provides a unified framework for text-guided, image-guided, and personalized 4D scene generation.
CVPR, 2024
3D Reconstruction with Generalizable Neural Fields using Scene Priors
NFP learns scene priors for fast, flexible 3D reconstruction and single-image novel-view synthesis.
ICLR, 2024
HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
HOIDiffusion generates realistic 3D hand-object interaction data for perception tasks such as 6D pose estimation.
CVPR, 2024
TUVF: Learning Generalizable Texture UV Radiance Fields
The paper introduces TUVF, a method for learning generalizable texture UV radiance fields.
arXiv, 2023
Affordance diffusion: Synthesizing hand-object interactions
The paper proposes a method for interaction synthesis that addresses issues using diffusion models. They build upon the classic idea of disentangling where to interact (layout) from how to interact (content).
CVPR, 2023
Open-vocabulary panoptic segmentation with text-to-image diffusion models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation.
CVPR, 2023
CoordGAN: Self-Supervised Dense Correspondences Emerge from GANs
This work introduces Coordinate GAN (CoordGAN), a structure-texture disentangled GAN that learns a dense correspondence map for each generated image.
CVPR, 2022
GroupViT: Semantic Segmentation Emerges from Text Supervision
This paper proposes a hierarchical Grouping Vision Transformer (GroupViT), which learns to group image regions into progressively larger arbitrary-shaped segments.
CVPR, 2022
Autoregressive 3D Shape Generation via Canonical Mapping
The paper demonstrates a solution for 3D point cloud generation using transformers. The key idea is to decompose a point cloud into a sequence of semantically meaningful shape compositions, which are further encoded by an autoregressive model for point cloud generation.
ECCV, 2022
Learning Continuous Image Representation with Local Implicit Image Function
The paper presents a method for learning continuous image representation with local implicit image function.
CVPR, 2021
Video Autoencoder: self-supervised disentanglement of static 3D structure and motion
This paper presents a video autoencoder for learning disentangled representations of 3D structure and camera pose from videos in a self-supervised manner.
ICCV, 2021
Learning 3D Dense Correspondence via Canonical Point Autoencoder
The paper presents a method for learning 3D dense correspondence using a canonical point autoencoder.
NeurIPS, 2021
ICLR, 2021
Joint-task self-supervised learning for temporal correspondence
This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner.
NeurIPS, 2019
BMVC, 2019
CVPR, 2019
ECCV, 2018
NeurIPS, 2017
BMVC, 2017
CVPR workshop, 2017
CVPR, 2017
ICIP, 2014
CVPR, 2013