We propose 6Img-to-3D, a transformer-based encoder-renderer method. 6Img-to-3D generates 3D-consistent multi-view images for large-scale unbounded outdoor driving scenarios from only six input images. Existing approaches require detailed pose information and cannot reconstruct occluded regions faithfully. We take a step towards resolving these shortcomings by combining cross- and self-attention mechanisms for triplane parameterization with differentiable volume rendering. We showcase that six surround-view vehicle cameras without absolute pose information are enough to reconstruct 360° scenes during inference time.
The pixel-aligned image features from the six outward-looking images are indivually extracted using a pre-trained ResNet-101 followed by a Feature Pyramidal Network (FPN). The features are then projected into a triplane using deformable attention. The renderer is a shallow MLP, which takes the triplane latent 3D features concatenated to the corresponding 2D image features, obtained by projecting the 3D coordinates into the 2D image space. The renderer outputs the color and opacity of the 3D features. The color and opacity are then used to render the 3D features into the 2D image space, using volumetric rendering. The rendered 2D images are then used to compute the loss, which is backpropagated through the entire network. The entire network is trained end-to-end.
In order to further improve the quality of the rendered images and speed up the inference, the images are rendered at a lower resolution and then upsampled using SwinFIR.
Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
---|---|---|---|
PixelNeRF | 15.263 | 0.695 | 0.673 |
w/o Scene Contraction | 17.493 | 0.726 | 0.479 |
w/o LPIPS Loss | 18.953 | 0.736 | 0.538 |
w/o Image Feature Proj. | 18.440 | 0.726 | 0.488 |
w SwinFIR Upscaler | 19.188 | 0.746 | 0.444 |
6Img-to-3D (Ours) | 18.864 | 0.733 | 0.453 |
Suitable dataset for the task of reconstructing scenes from outward-looking cameras onboard of a vehicle does not exist. We fill this gap by introducing our own Inward-Outward synthetic dataset, composed of 1900 training and 100 validation scenes, each containing six outward-facing and 100 inward-facing cameras. The outward-facing cameras follow the NuScenes setup. The dataset is generated using the CARLA Simulator. 3D bounding boxes for all vehicles are provided, as well as the semantic and instance segmentation masks and the depth maps. The dataset is available for download here.
Town | Number of Scenes | Split | Description |
---|---|---|---|
Town01 | 255 | Train | A small, simple town with a river and several bridges. |
Town03 | 265 | Train | A larger, urban map with a roundabout and large junctions. |
Town04 | 372 | Train | A small town embedded in the mountains with a special "figure of 8" infinite highway. |
Town05 | 302 | Train | Squared-grid town with cross junctions and a bridge. It has multiple lanes per direction. Useful to perform lane changes. |
Town06 | 436 | Train | Long many lane highways with many highway entrances and exits. It also has a Michigan left. |
Town07 | 116 | Train | A rural environment with narrow roads, corn, barns and hardly any traffic lights. |
Town10 | 155 | Train | A downtown urban environment with skyscrapers, residential buildings and an ocean promenade. |
Town02 | 101 | Val | A small simple town with a mixture of residential and commercial buildings. |
Town | Input | Supervision | ||||
---|---|---|---|---|---|---|
Town 1 | ||||||
Town 2 | ||||||
Town 3 | ||||||
Town 4 | ||||||
Town 5 | ||||||
Town 6 | ||||||
Town 7 | ||||||
@misc{gieruc20246imgto3d,
title={6Img-to-3D: Few-Image Large-Scale Outdoor Driving Scene Reconstruction},
author={Théo Gieruc and Marius Kästingschäfer and Sebastian Bernhard and Mathieu Salzmann},
year={2024},
eprint={2404.12378},
archivePrefix={arXiv},
primaryClass={cs.CV}
}