I am a PhD student at Stanford, beginning in fall 2022, working in the Stanford AI Lab. I am advised by Jiajun Wu and Fei-Fei Li.
I have also worked at Google Research as a student researcher, mentored by Deqing Sun and Charles Herrmann.
Previously, I was an AI resident at Google Research. Before joining Google, I received my undergraduate degree from Harvard, where I studied CS and math.
We empirically analyze view synthesis models as data augmentation for learning viewpoint-invariant policies from single-viewpoint demonstration data. On out-of-distribution camera viewpoints, our method outperforms baselines in both simulated and real-world manipulation tasks.
Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick
ECCV, 2024.
arxiv /
project page /
We finetune a video diffusion model for synthesizing large-angle novel viewpoints of dynamic scenes from a single monocular video. Our framework predicts RGB novel views of dynamic scenes, and we additionally extend it to show applications in semantic segmentation for driving scenes.
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image
Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing “Koven” Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, Jiajun Wu
CVPR, 2024.
arxiv /
project page /
We train a 3D-aware diffusion model, ZeroNVS on a mixture of scene data sources that capture object-centric, indoor, and outdoor scenes. This enables zero-shot SDS distillation of 360-degree NeRF scenes from a single image. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting. We also use the MipNeRF-360 dataset as a benchmark for single-image NVS.
WonderJourney: Going from Anywhere to Everywhere
Hong-Xing “Koven” Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann
CVPR, 2024.
arxiv /
project page /
We introduce WonderJourney, a modularized framework for perpetual scene generation. W start at any user-provided location (by a text description or an image), and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary wonderjourneys
NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations
Varun Jampani*, Kevis-Kokitsi Maninis*, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andre Araujo, Ricardo Martin-Brualla, Kaushal Patel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, Howard Zhou
NeurIPS, 2023.
arxiv /
project page /
We propose “NAVI”: a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters.
VQ3D: Learning a 3D Generative Model on ImageNet
Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun
ICCV, 2023. Oral Presentation. Best paper finalist.
arxiv /
project page /
VQ3D introduces a 3D-aware NeRF-based decoder to the 2-stage ViT-VQGAN. Then, Stage 1 allows for novel view synthesis from input images, and Stage 2 allows for generation of totally new 3D-aware images. We achieve an ImageNet FID of 16.8, compared to 69.8 for the best baseline.
We introduce self-supervised AutoFlow to handle real-world videos without ground truth labels, using self-supervised loss as the search metric.
Pyramid Adversarial Training Improves ViT Performance
Kyle Sargent,* Charles Herrmann,* Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu, Dilip Krishnan, Deqing Sun (*equal contribution)
CVPR, 2022. Oral presentation. Best paper finalist.
arxiv /
Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). We propose “Pyramid Adversarial Training,” a strong adversarial augmentation which perturbs images at multiple scales during training. We achieve a new state of the art on ImageNet-C, ImageNet-Rendition, and ImageNet-Sketch using only our augmentation and the standard ViT-B/16 backbone.
SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting
V. Jampani*, H. Chang*, K. Sargent, A. Kar, R. Tucker, M. Krainin, D. Kaeser, W. T. Freeman, D. Salesin, B. Curless, C. Liu (*equal contribution)
International Conference on Computer Vision, ICCV, 2021. Oral presentation. arxiv /
project page /
We design a unified system for novel view synthesis which leverages soft layering and depth-aware inpainting to achieve state-of-the-art results on multiple view synthesis datasets. We leverage the soft layering to incorporate matting, which allows the incorporation of intricate details to synthesized views.