6Img-to-3D Driving Demo

6Img-to-3D Zero-Shot on NuScenes

Abstract

We propose 6Img-to-3D, a transformer-based encoder-renderer method. 6Img-to-3D generates 3D-consistent multi-view images for large-scale unbounded outdoor driving scenarios from only six input images. Existing approaches require detailed pose information and cannot reconstruct occluded regions faithfully. We take a step towards resolving these shortcomings by combining cross- and self-attention mechanisms for triplane parameterization with differentiable volume rendering. We showcase that six surround-view vehicle cameras without absolute pose information are enough to reconstruct 360° scenes during inference time.

Method

The pixel-aligned image features from the six outward-looking images are indivually extracted using a pre-trained ResNet-101 followed by a Feature Pyramidal Network (FPN). The features are then projected into a triplane using deformable attention. The renderer is a shallow MLP, which takes the triplane latent 3D features concatenated to the corresponding 2D image features, obtained by projecting the 3D coordinates into the 2D image space. The renderer outputs the color and opacity of the 3D features. The color and opacity are then used to render the 3D features into the 2D image space, using volumetric rendering. The rendered 2D images are then used to compute the loss, which is backpropagated through the entire network. The entire network is trained end-to-end.

In order to further improve the quality of the rendered images and speed up the inference, the images are rendered at a lower resolution and then upsampled using SwinFIR.

Image description

Results

Method PSNR SSIM LPIPS
PixelNeRF 15.263 0.695 0.673
w/o Scene Contraction 17.493 0.726 0.479
w/o LPIPS Loss 18.953 0.736 0.538
w/o Image Feature Proj. 18.440 0.726 0.488
w SwinFIR Upscaler 19.188 0.746 0.444
6Img-to-3D (Ours) 18.864 0.733 0.453

Inward-Outward Dataset

Suitable dataset for the task of reconstructing scenes from outward-looking cameras onboard of a vehicle does not exist. We fill this gap by introducing our own Inward-Outward synthetic dataset, composed of 1900 training and 100 validation scenes, each containing six outward-facing and 100 inward-facing cameras. The outward-facing cameras follow the NuScenes setup. The dataset is generated using the CARLA Simulator. 3D bounding boxes for all vehicles are provided, as well as the semantic and instance segmentation masks and the depth maps. The dataset is available for download here.

Image description
The dataset provides RGB, depth maps, semantic and instance segmentation for each image.
Furthermore, the 3D bounding boxes of each vehicle in the scene is also available
Town Number of Scenes Split Description
Town01 255 Train A small, simple town with a river and several bridges.
Town03 265 Train A larger, urban map with a roundabout and large junctions.
Town04 372 Train A small town embedded in the mountains with a special "figure of 8" infinite highway.
Town05 302 Train Squared-grid town with cross junctions and a bridge. It has multiple lanes per direction. Useful to perform lane changes.
Town06 436 Train Long many lane highways with many highway entrances and exits. It also has a Michigan left.
Town07 116 Train A rural environment with narrow roads, corn, barns and hardly any traffic lights.
Town10 155 Train A downtown urban environment with skyscrapers, residential buildings and an ocean promenade.
Town02 101 Val A small simple town with a mixture of residential and commercial buildings.

PixelNeRF

6Img-to-3D

PixelNeRF

6Img-to-3D

Town Input Supervision
Town 1 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image
Town 2 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image
Town 3 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image
Town 4 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image
Town 5 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image
Town 6 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image
Town 7 input image input image input image supervision image supervision image supervision image
input image input image input image supervision image supervision image supervision image

BibTeX

@misc{gieruc20246imgto3d,
      title={6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction}, 
      author={Théo Gieruc and Marius Kästingschäfer and Sebastian Bernhard and Mathieu Salzmann},
      year={2024},
      eprint={2404.12378},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

The research leading to these results is partially funded by the German Federal Ministry for Economic Affairs and Climate Action within the project “NXT GEN AI METHODS". The authors wish to extend their sincere gratitude to the creators of TPVformer, NeRFStudio, and the KPlanes paper for generously open-sourcing their code. Thanks to Nerfies for the website template.