6Img-to-3D

Abstract

We propose 6Img-to-3D, a transformer-based encoder-renderer method. 6Img-to-3D generates 3D-consistent multi-view images for large-scale unbounded outdoor driving scenarios from only six input images. Existing approaches require detailed pose information and cannot reconstruct occluded regions faithfully. We take a step towards resolving these shortcomings by combining cross- and self-attention mechanisms for triplane parameterization with differentiable volume rendering. We showcase that six surround-view vehicle cameras without absolute pose information are enough to reconstruct 360° scenes during inference time.

Method

The pixel-aligned image features from the six outward-looking images are indivually extracted using a pre-trained ResNet-101 followed by a Feature Pyramidal Network (FPN). The features are then projected into a triplane using deformable attention. The renderer is a shallow MLP, which takes the triplane latent 3D features concatenated to the corresponding 2D image features, obtained by projecting the 3D coordinates into the 2D image space. The renderer outputs the color and opacity of the 3D features. The color and opacity are then used to render the 3D features into the 2D image space, using volumetric rendering. The rendered 2D images are then used to compute the loss, which is backpropagated through the entire network. The entire network is trained end-to-end.

In order to further improve the quality of the rendered images and speed up the inference, the images are rendered at a lower resolution and then upsampled using SwinFIR.

Results

Method	PSNR ↑	SSIM ↑	LPIPS ↓
PixelNeRF	15.263	0.695	0.673
w/o Scene Contraction	17.493	0.726	0.479
w/o LPIPS Loss	18.953	0.736	0.538
w/o Image Feature Proj.	18.440	0.726	0.488
w SwinFIR Upscaler	19.188	0.746	0.444
6Img-to-3D (Ours)	18.864	0.733	0.453

Inward-Outward Dataset

Suitable dataset for the task of reconstructing scenes from outward-looking cameras onboard of a vehicle does not exist. We fill this gap by introducing our own Inward-Outward synthetic dataset, composed of 1900 training and 100 validation scenes, each containing six outward-facing and 100 inward-facing cameras. The outward-facing cameras follow the NuScenes setup. The dataset is generated using the CARLA Simulator. 3D bounding boxes for all vehicles are provided, as well as the semantic and instance segmentation masks and the depth maps. The dataset is available for download here.

The dataset provides RGB, depth maps, semantic and instance segmentation for each image.
Furthermore, the 3D bounding boxes of each vehicle in the scene is also available

Town	Number of Scenes	Split	Description
Town01	255	Train	A small, simple town with a river and several bridges.
Town03	265	Train	A larger, urban map with a roundabout and large junctions.
Town04	372	Train	A small town embedded in the mountains with a special "figure of 8" infinite highway.
Town05	302	Train	Squared-grid town with cross junctions and a bridge. It has multiple lanes per direction. Useful to perform lane changes.
Town06	436	Train	Long many lane highways with many highway entrances and exits. It also has a Michigan left.
Town07	116	Train	A rural environment with narrow roads, corn, barns and hardly any traffic lights.
Town10	155	Train	A downtown urban environment with skyscrapers, residential buildings and an ocean promenade.
Town02	101	Val	A small simple town with a mixture of residential and commercial buildings.

BibTeX

@article{gieruc20246imgto3d,
      title={6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction}, 
      author={Théo Gieruc and Marius Kästingschäfer and Sebastian Bernhard and Mathieu Salzmann},
      year={2024},
      eprint={2404.12378},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      journal      = {arXiv preprint},
      volume     = {arXiv:2404.12378},
      }

Acknowledgement

The research leading to these results is partially funded by the German Federal Ministry for Economic Affairs and Climate Action within the project “NXT GEN AI METHODS". The authors wish to extend their sincere gratitude to the creators of TPVformer, NeRFStudio, and the KPlanes paper for generously open-sourcing their code. Thanks to Nerfies for the website template.

Town	Input			Supervision
Town 1
Town 1
Town 2
Town 2
Town 3
Town 3
Town 4
Town 4
Town 5
Town 5
Town 6
Town 6
Town 7
Town 7

6Img-to-3D

IV 2025 (Oral Presentation)

6Img-to-3D Driving Demo

6Img-to-3D Zero-Shot on NuScenes