Although neural radiance fields (NeRF) have shown impressive advances for
novel view synthesis, most methods typically require multiple input images of
the same scene with accurate camera poses. In this work, we seek to
substantially reduce the inputs to a single unposed image. Existing approaches
condition on local image features to reconstruct a 3D object, but often render
blurry predictions at viewpoints that are far away from the source view. To
address this issue, we propose to leverage both the global and local features
to form an expressive 3D representation. The global features are learned from a
vision transformer, while the local features are extracted from a 2D
convolutional network. To synthesize a novel view, we train a multilayer
perceptron (MLP) network conditioned on the learned 3D representation to
perform volume rendering. This novel 3D representation allows the network to
reconstruct unseen regions without enforcing constraints like symmetry or
canonical coordinate systems. Our method can render novel views from only a
single input image and generalize across multiple object categories using a
single model. Quantitative and qualitative evaluations demonstrate that the
proposed method achieves state-of-the-art performance and renders richer
details than existing approaches.
Beschreibung
Vision Transformer for NeRF-Based View Synthesis from a Single Input Image
%0 Generic
%1 lin2022vision
%A Lin, Kai-En
%A Yen-Chen, Lin
%A Lai, Wei-Sheng
%A Lin, Tsung-Yi
%A Shih, Yi-Chang
%A Ramamoorthi, Ravi
%D 2022
%K reconstruction
%T Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image
%U http://arxiv.org/abs/2207.05736
%X Although neural radiance fields (NeRF) have shown impressive advances for
novel view synthesis, most methods typically require multiple input images of
the same scene with accurate camera poses. In this work, we seek to
substantially reduce the inputs to a single unposed image. Existing approaches
condition on local image features to reconstruct a 3D object, but often render
blurry predictions at viewpoints that are far away from the source view. To
address this issue, we propose to leverage both the global and local features
to form an expressive 3D representation. The global features are learned from a
vision transformer, while the local features are extracted from a 2D
convolutional network. To synthesize a novel view, we train a multilayer
perceptron (MLP) network conditioned on the learned 3D representation to
perform volume rendering. This novel 3D representation allows the network to
reconstruct unseen regions without enforcing constraints like symmetry or
canonical coordinate systems. Our method can render novel views from only a
single input image and generalize across multiple object categories using a
single model. Quantitative and qualitative evaluations demonstrate that the
proposed method achieves state-of-the-art performance and renders richer
details than existing approaches.
@misc{lin2022vision,
abstract = {Although neural radiance fields (NeRF) have shown impressive advances for
novel view synthesis, most methods typically require multiple input images of
the same scene with accurate camera poses. In this work, we seek to
substantially reduce the inputs to a single unposed image. Existing approaches
condition on local image features to reconstruct a 3D object, but often render
blurry predictions at viewpoints that are far away from the source view. To
address this issue, we propose to leverage both the global and local features
to form an expressive 3D representation. The global features are learned from a
vision transformer, while the local features are extracted from a 2D
convolutional network. To synthesize a novel view, we train a multilayer
perceptron (MLP) network conditioned on the learned 3D representation to
perform volume rendering. This novel 3D representation allows the network to
reconstruct unseen regions without enforcing constraints like symmetry or
canonical coordinate systems. Our method can render novel views from only a
single input image and generalize across multiple object categories using a
single model. Quantitative and qualitative evaluations demonstrate that the
proposed method achieves state-of-the-art performance and renders richer
details than existing approaches.},
added-at = {2022-07-17T16:17:07.000+0200},
author = {Lin, Kai-En and Yen-Chen, Lin and Lai, Wei-Sheng and Lin, Tsung-Yi and Shih, Yi-Chang and Ramamoorthi, Ravi},
biburl = {https://www.bibsonomy.org/bibtex/209b914b99a0ca27b00c9c08b3259fbdd/redtedtezza},
description = {Vision Transformer for NeRF-Based View Synthesis from a Single Input Image},
interhash = {1bcae30d9e88b0b87dab49285b8bbae5},
intrahash = {09b914b99a0ca27b00c9c08b3259fbdd},
keywords = {reconstruction},
note = {cite arxiv:2207.05736Comment: Project website: https://cseweb.ucsd.edu/~viscomp/projects/VisionNeRF/},
timestamp = {2022-07-17T16:17:07.000+0200},
title = {Vision Transformer for NeRF-Based View Synthesis from a Single Input
Image},
url = {http://arxiv.org/abs/2207.05736},
year = 2022
}