Review of two NIPS 2020 Papers on 3D Reconstructions from 2D Images.
Took a look into couple of papers on 3D reconstructions from NIPS 2020. The papers were orally presented and hence can be considered significant. The two papers are Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance [1] and Convolutional Generation of Textured 3D Meshes [2]. There were 8 other paper with “mesh” in their title. Of these GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis [3] looks interesting.
A number of terms needs some mentions to understand these papers.
UV Sphere: In computer graphics, texture means a 2D color patterns on a flat surface (e.g. triangles of a mesh). The common technique to handle variations of reflectance is to store the reflectance as a function of pixel in a pixel-based image and “map” it onto a 3D surface . The function or image is called a texture map, and the process of controlling reflectance properties is called texture mapping. UV mapping is the process of projecting a 2D image/pattern to a 3D model’s surface for texture mapping. The letters “U” and “V” denote the axes of the 2D texture because “X”, “Y”, and “Z” are already used to denote the axes of the 3D object in model space, while “W” (in addition to XYZ) is used for denoting quaternion rotations. Each face/polygon vertex in the 3D model has a corresponding (u,v) point in the UV Map. The map can be for reflectance, bump or surface normal for bumpy look, displacement for geometry change, etc. Unwrapping is flattening out the the UV shell to a flat 2D space. UV Sphere refers to the mesh geometry used in a sphere that is transferred to the 3D model. The 3D model has the same number of vertices ,edges and faces as the UV Sphere.
Differentiable Renderer: The process of generating 2D image from 3D model is called rendering. Popular rendering APIs, such as OpenGL and DirectX3D, decompose the process of rendering into a set of sequential user-defined programs called shaders. There exist many different types of shaders — vertex, rasterization, and fragment shaders are the three most important for a rendering pipeline. A rendering function R takes shape parameters Φs, camera parameters Φc, material parameters Φm and lighting parameters Φl as input and outputs an RGB image Ic or a depth image Id. A differentiable renderer computes the gradients of the output image with respect to the input parameters ∂I/∂Φ in order to optimize for a specific goal (loss function). Paper [5] surveys Differentiable Rendering.
Rasterization: The process of finding all the pixels in an image that are occupied by a geometric primitive is called rasterization. Rasterization transforms vector representation to pixel representation. In the graphics pipeline, rasterization is a non-differentiable function. However, good approximations have been developed [4,6,7] that allow back propagation.
Fréchet Inception Distance (FID) [8]: The difference of two Gaussians (e.g. synthetic and real images) is measured by the Fréchet distance also known as Wasserstein-2 distance. FID calculates the distance between feature vectors for real and synthetic images. Lower score makes the two sets of images similar. Notationally it is expressed as follows.
FID=|μr−μg|^2+Tr(Σr+Σg−2(ΣrΣg)^1/2)
Where Xr∼N(μr,Σr) and Xg∼N(μg,Σg) are the 2048-dimensional activations of the Inception-v3 pool3 layer for real and generated samples respectively. FID is commonly used as a measure GAN quality.
Signed Distance Function (SDF)[10]: SDF represents a shape’s surface by a continuous volumetric field whose magnitude (level) at a point in the field represents the distance to the surface boundary and sign indicates whether the region is inside (-) or outside (+) of the shape, hence the representation encodes a shape’s boundary as the zero-level-set of the function and classifies space as being part of the shape’s interior or not.
The first paper, Convolutional Generation of Textured 3D Meshes’ ultimate goal is to produce a 3D texture model only from a single 2D image. Unlike multi-view stereo (MVS) and structure-from-motion (SfM) that requires multiple images from different angles for geometric estimation, here a mean/average model is learnt that is then deformed in appropriate way to produce 2D image. After going through this paper, I have to say that it is hard to follow. Without the NIPS presentation, it would have been impossible to decipher their work. I would have expected a NIPS selected paper to have been better organized and written.
The training consists of 3 steps. The first step is labeled Convolutional Mesh by the authors. The output of this step is the 3D mesh. This step is a simplified version of the pipeline in [6] without texture flow, pose, keypoints etc. The following picture shows the high level parts of this pipeline. In essence and auto-encoder is used to generate a 3D mesh which is rendered to produce an image. The UV mapping related parts in the picture are for visual clarification only, there is no compute involved, hence not have to worry about differentiability of these parts. The difference between the rendered image and the ground-truth image is then optimized as usual. The texture map is not used any further. The displacement map is used in step 3.
The second step is labeled Inverse Rendering by the authors. In this step ground truth images are projected into the UV space by a differentiable renderer. The differentiable renderer used is the one from [4] — Differentiable Interpolation-based Renderer(DIB-R). The reason for this step is that using the 3D texture to train a GAN where the generator G produces a 3D mesh and the discriminator D discriminates the differentiable rendering 2D projection, leads to instabilities due to differences in representation used by G and D — pose-independent and pose-dependent respectively.
The third step is training the GAN ( for a good intro to GAN see [14]) , see picture below. The displacement map from step 1 and pseudo-random texture from step 2 are used directly to train the discriminator D. The generator G produces full texture, but it is masked with a random mask from training set before discriminator D sees them. This apparently avoids distribution mismatch between fake and real textures.
For evaluation the authors used FID. They obtained FID scores of 18.45 and 27.73 on the CUB-200–2011 and Pascal3D+ datasets respectively. The Pascal3D+ data set is a semi-manually edited 3D CAD models for the images of 12 classes of the Pascal dataset [9].
The goal of the second paper, Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance, is to devise an end-to-end neural architecture system that can learn 3D geometries from masked 2D images and rough camera estimates with no additional supervision. The 3 unknowns the paper attempts to derive are 3D shape/geometry, appearance/texture, and camera pose. The authors call the architecture Implicit Differentiable Renderer (IDR).
The paper represents the geometry as the zero level set of a neural network
(MLP) f, Sθ = {x ∈ R3 | f(x; θ) = 0 } , x are the 3D points , f is a SDF function with parameter θ. When f is 0 at a point x, that point is on the surface of the shape. The following picture shows the high level architecture along with gemetrical aspects.
In the picture
Rp(τ ) = {cp + tv | t ≥ 0} denote the ray through pixel p, where c = c(τ ) denotes the unknown center of the view camera, v denotes the direction of the ray, n represents surface normal on the shape.
L(θ, γ, τ ) = M(x, n, z, v; γ) is the rendered color of the pixel, and is a function of the surface properties at x, the incident radiance at x, and the viewing direction v. z is global geometry feature vector z = z(x; θ). M is implemented as a Multilayer Perceptron (MLP). θ, γ, τ, respectively represent geometry, appearance/texture, and camera center. Whereas f is parameterized by θ, M is parameterized by γ, τ.
The model simultaneously trains θ, γ, and τ against the loss function shown in the picture and below.
loss(θ, γ, τ ) =lossRGB(θ, γ, τ ) + ρ*lossMASK(θ, τ ) + λ*lossE(θ)
The loss term has 3 components. The first one is for RGB differences against the ground truth image Ip — it is a L1 distance function. The second term comes from using a binary mask representing whether a pixel is part of shape or not. It is a cross entropy loss between the mask value and surface geometry function S. The third term is a regularization (Eikonal) to force f to be approximately a signed distance function [13]
Both f and M are MLPs of 8 and 4 layers respectively. The dataset used is DTU MVS [12] — where each image is 1200x1600 RGB of 80 objects with 49/64 camera positions. As the authors are not using any convolution filters, they sample 2048 pixels from each picture and optimize iteratively.
This write up sort of provides high level view of the 2 papers. Details of implementations and mathematics are left out, but not really very hard to go through as the authors have provided the source code and supplementary materials.
References:
- Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance, oral:https://neurips.cc/virtual/2020/protected/poster_1a77befc3b608d6ed363567685f70e1e.html, arxiv: https://arxiv.org/abs/2003.09852, github:https://github.com/lioryariv/idr
- Convolutional Generation of Textured 3D Meshes, oral:https://neurips.cc/virtual/2020/protected/poster_098d86c982354a96556bd861823ebfbd.html, arxiv:https://arxiv.org/abs/2006.07660, github:https://github.com/dariopavllo/convmesh
- GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis, poster:https://neurips.cc/virtual/2020/protected/poster_e92e1b476bb5262d793fd40931e0ed53.html .
- Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer, https://arxiv.org/pdf/1908.01210.pdf
- Differentiable Rendering: A Survey,https://arxiv.org/pdf/2006.12057.pdf
- Learning Category-Specific Mesh Reconstruction from Image Collections, arxiv:https://arxiv.org/pdf/1803.07549.pdf, github:https://github.com/akanazawa/cmr
- Neural 3D Mesh Renderer, https://arxiv.org/pdf/1711.07566.pdf
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, https://arxiv.org/abs/1706.08500
- Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild, https://cvgl.stanford.edu/papers/xiang_wacv14.pdf
- DeepSDF: Learning Continuous Signed Distance Functions
for Shape Representation, https://arxiv.org/pdf/1901.05103.pdf - Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision, https://arxiv.org/abs/1912.07372
- Large Scale Multi-view Stereopsis Evaluation, https://roboimagedata2.compute.dtu.dk/data/text/multiViewCVPR2014.pdf
- Implicit Geometric Regularization for Learning Shapes, https://arxiv.org/pdf/2002.10099.pdf .
- GAN Lab, https://poloclub.github.io/ganlab/