Abstract
For many fundamental scene understanding tasks, it is difficult or impossible
to obtain per-pixel ground truth labels from real images. We address this
challenge by introducing Hypersim, a photorealistic synthetic dataset for
holistic indoor scene understanding. To create our dataset, we leverage a large
repository of synthetic scenes created by professional artists, and we generate
77,400 images of 461 indoor scenes with detailed per-pixel labels and
corresponding ground truth geometry. Our dataset: (1) relies exclusively on
publicly available 3D assets; (2) includes complete scene geometry, material
information, and lighting information for every scene; (3) includes dense
per-pixel semantic instance segmentations and complete camera information for
every image; and (4) factors every image into diffuse reflectance, diffuse
illumination, and a non-diffuse residual term that captures view-dependent
lighting effects.
We analyze our dataset at the level of scenes, objects, and pixels, and we
analyze costs in terms of money, computation time, and annotation effort.
Remarkably, we find that it is possible to generate our entire dataset from
scratch, for roughly half the cost of training a popular open-source natural
language processing model. We also evaluate sim-to-real transfer performance on
two real-world scene understanding tasks - semantic segmentation and 3D shape
prediction - where we find that pre-training on our dataset significantly
improves performance on both tasks, and achieves state-of-the-art performance
on the most challenging Pix3D test set. All of our rendered image data, as well
as all the code we used to generate our dataset and perform our experiments, is
available online.
Description
2011.02523.pdf
Links and resources
Tags