VQ3D: Learning a 3D-Aware Generative Model on ImageNet

ICCV2023 Oral presentation Best paper finalist

Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun


VQ3D is a 2-stage autoencoder based on ViT-VQGAN. We use a novel 3D-aware NeRF-based decoder as well as depth losses and adversarial supervision on main and novel views to encourage learning of 3D-aware representations. The resulting model has multiple capabilities. Stage 1 of our model is an encoder-decoder architecture which can encode unseen, unposed RGB images and produce a NeRF with consistent depth and high-quality novel views. Stage 2 is a fully generative model of 3D-aware images. For technical details, please consult our paper. Thanks!


Stage 2: Example generated 3D images:

We show generated 3D images as well as normalized disparity below. Our method is capable of training on and then generating examples from the full ImageNet dataset of 1.2 million training images and 1000 classes.

Stage 1: Example single-view 3D reconstructions with novel views:

We show results from manipulating real images to generate novel views. Our model achieves good reconstruction at the canonical viewpoint and plausible novel views.

Example object manipulations within a scene:

Our Stage 1 can convert a single RGB input image into a NeRF, and then render novel views while manipulating adding objects. The only input to the model is a single unposed RGB image which is unseen during training. The NeRF is created without any optimization in a single forward pass of our network. Still, the reconstructed scene has plausible geometry, occlusion, and disoccluded pixels.

Original Image

Rendered novel views with additional object


Example generated scenes from CompCars:

Our model also achieves good performance on CompCars.

Example interpolation between two images:

Our Stage 1 can convert compute two NeRFs from two RGB images, and then interpolate between them. Vector-quantized models do not guarantee high quality interpolation, and achieving high-quality semantically aware interpolation is a direction for future work.

Input start image

Input end image



Importance of pointwise depth loss

Our depth loss constraints the NeRF disparity to lie close to the pseudo-GT disparity in a pointwise fashion. It supervises the pointwise volumetric rendering weights rather than the accumulated disparity. Below we see the affect of training with a depth loss on the accumulated disparity (left) versus our depth loss on the pointwise volumetric rendering weights (right).

Trained with depth loss on accumulated depth

Trained with our pointwise depth loss



    author = {
       Sargent, Kyle and Koh, Jing Yu and Zhang, Han and Huiwen Chang
       and Charles Herrmann and Pratul Srinivasan and Jiajun Wu and Deqing Sun},
    title = {{VQ3D}: Learning a {3D}-Aware Generative Model on {ImageNet}},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year = {2023}