Neural Point-Based Graphics
We present a new point-based approach for modeling complex scenes. The approach uses a raw point cloud as the geometric representation of a scene, and augments each point with a learnable neural descriptor that encodes local geometry and appearance. A deep rendering network is learned in parallel with the descriptors, so that new views of the scene can be obtained by passing the rasterizations of a point cloud from new viewpoints through this network. The input rasterizations use the learned descriptors as point pseudocolors. We show that the proposed approach can be used for modeling complex scenes and obtaining their photorealistic views, while avoiding explicit surface estimation and meshing. In particular, compelling results are obtained for scene scanned using hand-held commodity RGB-D sensors as well as standard RGB cameras even in the presence of objects that are challenging for standard mesh-based modeling.
Main idea

Having a set of RGB(D) images we first reconstruct a rough pointcloud of the scene using the classic Structure From Motion (SfM) and Multiview Stereo (MVS) algorithms.

With every point in the pointcloud we associate a small learnable N-dimensional descriptor (similar to 3-dimensional color descriptor, that every point already has). We then project the descriptors to virtual cameras, estimated by SfM (similarly to how the colored pointcloud is projected to a camera) and feed those projections to a ConvNet, which is then learned to render the scene from the corresponding view. We learn the ConvNet jointly with the descriptors to minimize the discrepancy between the predicted rendering and actual image captured by a real camera.

At train time we learn the mentioned ConvNet on multiple scenes to make it universal. At test time, for an unseen set of RGB(D) images we repeat the training pipeline, except we fix the ConvNet and only optimize the descriptors of the points. Having both descriptors and the network trained we can render the scene from an arbitrary standpoint.

Our method successfully generalizes to novel views and enables a very photo-realistic real-time rendering of complex scenes.

The authors acknowledge the usage of the Skoltech CDISE HPC cluster ZHORES for obtaining the results presented in this paper.