VQ3D is a 2-stage autoencoder based on ViT-VQGAN. We use a novel 3D-aware NeRF-based decoder as well as depth losses and adversarial supervision on main and novel views to encourage learning of 3D-aware representations. The resulting model has multiple capabilities. Stage 1 of our model is an encoder-decoder architecture which can encode unseen, unposed RGB images and produce a NeRF with consistent depth and high-quality novel views. Stage 2 is a fully generative model of 3D-aware images. For technical details, please consult our paper. Thanks!
 Original Image
Rendered novel views with additional object
Input start image
Input end image
Interpolation
Trained with depth loss on accumulated depth
Trained with our pointwise depth loss