Kyle Sargent

I am a PhD student at Stanford, beginning in fall 2022, working in the Stanford AI Lab. I am advised by Jiajun Wu and Fei-Fei Li. I have also worked at Google Research as a student researcher, mentored by Deqing Sun and Charles Herrmann.

Previously, I was an AI resident at Google Research. Before joining Google, I received my undergraduate degree from Harvard, where I studied CS and math.

Email / GitHub / Google Scholar / LinkedIn / Misc

Research

I work in computer vision. My main areas of focus are 3D reconstruction, novel view synthesis, and 3D generative models.

	Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick ECCV, 2024. arxiv / project page / We finetune a video diffusion model for synthesizing large-angle novel viewpoints of dynamic scenes from a single monocular video. Our framework predicts RGB novel views of dynamic scenes, and we additionally extend it to show applications in semantic segmentation for driving scenes.
	ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing “Koven” Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry Lagun, Li Fei-Fei, Deqing Sun, Jiajun Wu CVPR, 2024. arxiv / project page / We train a 3D-aware diffusion model, ZeroNVS on a mixture of scene data sources that capture object-centric, indoor, and outdoor scenes. This enables zero-shot SDS distillation of 360-degree NeRF scenes from a single image. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting. We also use the MipNeRF-360 dataset as a benchmark for single-image NVS.
	WonderJourney: Going from Anywhere to Everywhere Hong-Xing “Koven” Yu, Haoyi Duan, Junhwa Hur, Kyle Sargent, Michael Rubinstein, William T. Freeman, Forrester Cole, Deqing Sun, Noah Snavely, Jiajun Wu, Charles Herrmann CVPR, 2024. arxiv / project page / We introduce WonderJourney, a modularized framework for perpetual scene generation. W start at any user-provided location (by a text description or an image), and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey, a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes, and a large VLM to verify the generated scenes. We show compelling, diverse visual results across various scene types and styles, forming imaginary wonderjourneys
	NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engelhardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andre Araujo, Ricardo Martin-Brualla, Kaushal Patel, Daniel Vlasic, Vittorio Ferrari, Ameesh Makadia, Ce Liu, Yuanzhen Li, Howard Zhou NeurIPS, 2023. arxiv / project page / We propose “NAVI”: a new dataset of category-agnostic image collections of objects with high-quality 3D scans along with per-image 2D-3D alignments providing near-perfect GT camera parameters.
	VQ3D: Learning a 3D Generative Model on ImageNet Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun ICCV, 2023. Oral Presentation. Best paper finalist. arxiv / project page / VQ3D introduces a 3D-aware NeRF-based decoder to the 2-stage ViT-VQGAN. Then, Stage 1 allows for novel view synthesis from input images, and Stage 2 allows for generation of totally new 3D-aware images. We achieve an ImageNet FID of 16.8, compared to 69.8 for the best baseline.
	Self-supervised AutoFlow Hsin-Ping Huang, Charles Herrmann, Junhwa Hur, Erika Lu, Kyle Sargent, Austin Stone, Ming-Hsuan Yang, Deqing Sun CVPR, 2023. arxiv / project page / We introduce self-supervised AutoFlow to handle real-world videos without ground truth labels, using self-supervised loss as the search metric.
	Pyramid Adversarial Training Improves ViT Performance Kyle Sargent,* Charles Herrmann,* Lu Jiang, Ramin Zabih, Huiwen Chang, Ce Liu, Dilip Krishnan, Deqing Sun (equal contribution) CVPR*, 2022. Oral presentation. Best paper finalist. arxiv / Aggressive data augmentation is a key component of the strong generalization capabilities of Vision Transformer (ViT). We propose “Pyramid Adversarial Training,” a strong adversarial augmentation which perturbs images at multiple scales during training. We achieve a new state of the art on ImageNet-C, ImageNet-Rendition, and ImageNet-Sketch using only our augmentation and the standard ViT-B/16 backbone.
	SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting V. Jampani, H. Chang, K. Sargent, A. Kar, R. Tucker, M. Krainin, D. Kaeser, W. T. Freeman, D. Salesin, B. Curless, C. Liu (equal contribution) International Conference on Computer Vision, ICCV*, 2021. Oral presentation. arxiv / project page / We design a unified system for novel view synthesis which leverages soft layering and depth-aware inpainting to achieve state-of-the-art results on multiple view synthesis datasets. We leverage the soft layering to incorporate matting, which allows the incorporation of intricate details to synthesized views.

Built with the Jekyll fork by Leonid Keselman of Jon Barron's website.

Kyle Sargent

Research

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

WonderJourney: Going from Anywhere to Everywhere

NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations

VQ3D: Learning a 3D Generative Model on ImageNet

Self-supervised AutoFlow

Pyramid Adversarial Training Improves ViT Performance

SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting