VQ3D is a two-stage autoencoder based on ViT-VQGAN. We introduce a novel 3D-aware NeRF-based decoder, incorporating depth losses and adversarial supervision on both primary and novel views to encourage the learning of 3D-aware representations.
The resulting model offers multiple capabilities. Stage 1 employs an encoder-decoder architecture that can encode unseen, unposed RGB images and generate a NeRF with consistent depth and high-quality novel views. Stage 2 functions as a fully generative model for 3D-aware images. For technical details, please consult our paper. Thank you!
Below are generated 3D images accompanied by normalized disparity. Our method is capable of training on and generating examples from the full ImageNet dataset, which includes 1.2 million training images spanning 1000 classes.
Here, we present results that manipulate real images to generate novel views. Our model achieves high-quality reconstruction at the canonical viewpoint along with plausible novel views.
Stage 1 can convert a single RGB input image into a NeRF, and then render novel views while adding objects. The model requires only a single unposed RGB image (unseen during training) to create a NeRF in one forward pass. The resulting scene exhibits plausible geometry, occlusion, and disocclusion.
Original Image
Rendered Novel Views with Additional Object
Our model also demonstrates strong performance on the CompCars dataset.
Stage 1 computes two NeRFs from two RGB images and interpolates between them. Note that while vector-quantized models do not inherently guarantee high-quality interpolation, semantically aware interpolation remains a promising area for future research.
Input Start Image
Input End Image
Interpolation
Our depth loss constrains the NeRF disparity to closely match the pseudo-ground truth disparity on a pointwise basis. This supervision focuses on the pointwise volumetric rendering weights rather than the accumulated disparity. The images below illustrate the effect of training with a depth loss on accumulated disparity (left) versus our pointwise depth loss (right).
Trained with Depth Loss on Accumulated Depth
Trained with Our Pointwise Depth Loss
@InProceedings{vq3d,
author = {
Sargent, Kyle and Koh, Jing Yu and Zhang, Han and Chang, Huiwen
and Herrmann, Charles and Srinivasan, Pratul and Wu, Jiajun and Sun, Deqing
},
title = {{VQ3D}: Learning a {3D}-Aware Generative Model on {ImageNet}},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2023}
}