Abstract
We present an end-to-end 3D reconstruction method for a scene by directly
regressing a truncated signed distance function (TSDF) from a set of posed RGB
images. Traditional approaches to 3D reconstruction rely on an intermediate
representation of depth maps prior to estimating a full 3D model of a scene. We
hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts
features from each image independently which are then back-projected and
accumulated into a voxel volume using the camera intrinsics and extrinsics.
After accumulation, a 3D CNN refines the accumulated features and predicts the
TSDF values. Additionally, semantic segmentation of the 3D model is obtained
without significant computation. This approach is evaluated on the Scannet
dataset where we significantly outperform state-of-the-art baselines (deep
multiview stereo followed by traditional TSDF fusion) both quantitatively and
qualitatively. We compare our 3D semantic segmentation to prior methods that
use a depth sensor since no previous work attempts the problem with only RGB
input.
Users
Please
log in to take part in the discussion (add own reviews or comments).