Deep Learning Multiview Stereo (MVS)

11 min readJan 18, 2022

The goal of Multiview Stereo (MVS) is to generate a 3D point cloud or model from pictures taken from different locations. It is a problem with long history, and pre-dates Deep Learning by decades. It is also an area where geometric and photometric consistency and priors are extensively used. Given the long history, there are lot of terms that have been used over time. I have tried to described those is sufficient details in the Glossary section.

References [8, 24] and [23] provide good overview and a somewhat comprehensive list respectively. In the following, share my understanding of 2 Deep Learning based MVS architectures. Before delving into the details of these architecture, would like mention 2 other publications. PatchmatchNet [13] is the Deep Learning adaptation of the PatchMatch (described in the Glossary section ) method. COLMAP [6] is one of the early architectures, it is not deep learning based, but commonly used for comparison.

GC-Net:

The first architecture is GC-Net [20], it is one of the earliest end-to-end trainable network for stereo-vision MVS. The paper is well written and the architecture is fairly standard and easy to follow. The following 2 diagrams show the network used by the authors.

The features are generated by 2D convolution network as is the standard. For each stereo image, the network forms a cost volume (see Glossary section for how cost volumes are constructed) of shape height×width×(max disparity + 1)×feature size. This is achieved by concatenating each unary feature with their corresponding unary from the opposite stereo image across each disparity level, and packing these into 4D volume. Hence, 2 branches are visible in the block diagram above. Regularization is commonly used to smooth out noise and produce better results. Hence it is applied to the cost volume. In the architecture, regularization was implemented with 3D convolution, hence learnt instead of using a hand crafted construct. The regularization network consists of 4 levels of sub-sampling through blocks of 3D convolution+batch-norm +relu (e.g. layers 21,24,27, and 30 in the figure above ). The 1/32 dimension feature is then up-sampled to the original shape in 5 steps. Each up-sample block is composed of 1 transpose convolution whose output is summed with the convolution feature of the same level, hence same shape. For example, layer 36 is the penultimate transpose convolution layer and layer 20 is the output from the first down sampled block. The residual connection helps in preserving resolution. The authors use a soft argmin for differentiability and sub-pixel accuracy. The soft argmin, unlike argmin is an weighted average, hence could be inaccurate for multi-modal distribution. The authors postulated that the regularization network would prevent a multi-modal distribution. The loss used is a L1 loss of disparity (e.g. average of absolute value of difference in depth between the network output and label).

The authors used synthetic data to pre-train their network and then fine tune and evaluate on the on the KITTI 2012 and 2015 stereo datasets.

PatchMatch-RL:

PatchMatch-RL MVS [1] is a very recent paper that was orally presented in ICCV 2021. The paper is harder to follow, which is probably attributable to some extent to the architecture being more complex., The authors probably could have provided more details to tie various parts of the architecture better. The following figure shows their architecture.

In multi-image MVS, usually one of the images of the scene is taken as reference and the rest as source images. When there are many images of a scene, only a subset of good ( by some criteria) is selected as source images for 3D point cloud determination. The view selection probability is per pixel and is realized by a MLP that uses geometric priors and feature correlation. The geometric priors used are Triangulation Prior, Resolution Prior and Incident Prior. The following picture shows these 3 priors. Triangulation Prior forces selection of source images that have sufficient baseline and hence different angle of view with respect to the reference image. Resolution Prior forces selection of images that have similar size and shape for patches. This implies preferences for images not taken at extreme distance and angles. Incident prior forces cameras to be in front .

The authors define the correlation value of the oriented point as the attention-aggregated group- wise correlation for matching feature vectors in the source image as shown by the equations in the following figure. In the figure, h stands for attention vector, implemented as a 1x1 convolution. σ is the Normal distribution. W is a squarish window of size α and dilation β. Group-wise correlation is obtained by taking the inner/dot production of the channel dimension after the channels have been evenly divided into number of groups [25].

Fig 5: Correlation between reference and source images.

As the authors use FPN to obtain features at different levels, they use the oriented points and hidden state of the immediate coarser level for initialization. At the final level, 3D point cloud is obtained by fusion [9] (e.g. averaging ).

The randomness used in PatchMatch precludes ordinal relationship between neighbors (e.g. volume or graph). The PatchMatch part in the architecture is implemented by using Loopy Belief Propagation (see Glossary for a description) through RNN. The input to RNN are visibility-weighted feature correlation appended with a pairwise term. The pairwise term is also a smoothness enabler [26]. The outputs of RNN are the estimated regularized costs. The hidden state of RNN corresponds to the belief for each candidate. The best candidate for the next iteration is hard sampled by the regularized cost.

The argmax based hard selection part of view selection and PatchMatch makes the architecture non-differentiable and hence impossible to train end-to-end with techniques such as back-propagation. The authors use Reinforcement Learning (RL), REINFORCE (Monte-Carlo Policy Gradient ) [28] specifically, to train end-to-end. See Glossary for a brief description of REINFORCE algorithm. They use 2 agents — one scores views and the other scores PatchMatch candidates.

The authors evaluated against 2 data sets , Tanks & Temples and ETH3D. The authors claim their architecture produced the best result at intermediate resolution of 5 cm, however falls short of hand-crafted traditional methods at lower resolutions.

The code is based on PyTorch Lightning, and for the most part follows the standard Deep Learning methodology — defines a loss and optimizes parameters through back-propagation. There is very little documentation and the variable names are cryptic. Training happens in a method called run_and_log. This method is eventually called by the fit method of PyTorch Lightning for back-propagation. The following shows the high level pseudo call stack for training.

run_and_log(self, batch, batch_idx, train, log_prefix)
  inf_p, steps = self.forward(images, ...)
    cameras = MVSCamera(K, E, P, images.shape[-2:], ranges)
      feature_layers = self.feature_extractor(images)
        step(cameras, features, ...)
          view_iw = select_views(cameras, features, plane_map,...)
            view_ps, view_i = sample(valid_view_s, ....)         
          planes = select_plane(cameras, features, view_iw, ...)
            plane_ps, plane_i = sample(cost_volume, ...)
  gt_n = compute_normal_from_depth(cameras, gt_d, True)
  gt_p = normal_depth_to_plane(cameras, gt_n, gt_d)
  d_nll = (-d_probs * step_p_probs.log()).sum(0, keepdim=True)
  d_loss += resize_bhwc(d_nll, out_shape) # compute depth loss
  n_nll = -(d_probs * n_probs * step_p_probs.log()).sum(0,..)
  n_loss += resize_bhwc(n_nll, out_shape) # compute normal loss
  rs = plane_similarity(cameras, step_p, step_gp, d_sigma, n_sigma)
  g_is.append(resize_bhwc(rs, out_shape))
  g_t = torch.stack(g_is).sum(0)
  v_loss += g_t * resize_bhwc(v_nll, out_shape) # all future reward
  loss = d_loss + n_loss + vs_loss

Conclusion:

MVS is considered one of the economical approaches for obtaining rich 3D point cloud with respect to other methods like LIDAR that require specialized and somewhat more expensive equipments. However, it is computationally intensive and suffers from ambiguities. The above are just a couple of examples of the works that have been going on for decades.

Glossary:

View Selection, refers to picking only a subset of images for stereopsis. Although the word stereo means 2, in realty many more than 2 images are used.

Geometric priors encourage the selection of views with
sufficient baseline (Triangulation Prior ), similar resolution (Resolution Prior ), and non-oblique viewing direction (Incident Prior ) [5].

Pixelwise selection refers to independent choice of images per pixel. Improves over fixed pre-selection of images for all pixels by reducing sensitivity to noise, better completeness, handing major changes to viewpoint between images better. Also handles occlusion and illumination variations better [12].

Plan-Sweep is a stack of planes at various depths parallel to a reference image. Each target image is projected to the reference image for each depth plane with homography producing a warped images. Reference image and each target image is then compared by some metrics (e.g. ZNCC). The most matched depth plane is chosen. See the following figure for a pictorial view. Also see [2] slide deck for a good overview.

PatchMatch conceptually is a matching algorithm between a patch in one image to the approximate nearest one in another image. The brute force approach for such matching is — O(mM**2 ), where m and M are the number of pixels in the patch and image respectively. The M**2 comes from having to look at every scaling of the target image as the images need not be of same scale. The algorithm is fairly simple with just 3 main steps as shown in the following figure.

Figure 7: PatchMatch algorithm (from [3])

The intuition behind the algorithm is that patches closer in source image are also closer in the target image. Hence, it is worthwhile to check if a nearby neighbor has a better match. If so, take that. Step c prevents the algorithm from getting stuck in a local minimum by exploring random patches in a concentric route. Step b and c are called propagate and perturb respectively. See [3] and [4] for great overview of the algorithm.

Cost volume is a way to incorporate geometrical constrains and priors [20,21,22] into the DL MVS pipeline. The basic idea is to take one of the images of the same scene as reference and project the rest of the images into the reference at a range of expected depths. The (homography) projection is possible as camera intrinsic and extrinsic matrices for each of the images are made available in some way. This volume of HxWxD can be filled with a number of values that in some way represents the cost or discrepancy of a pixel in the reference image and other images.

Regularization in the MVS context has similar meaning as in general ML context — prevent over fitting, address ill-posed problem, etc. Regularization can be considered a form of a priori constraint [22]. In the simplest form it is usually represented as minimization of |Az — y|² + λ|Pz|² , where both A and P, called stabilizing function, are linear. λ determines how much “regular” the solution should be.

Depth map refers to the depth of each pixel of the reference image. Depth map can be directly projected to space by using extrinsic and intrinsic camera parameters to generate 3D point clouds.

Pixel normal refers to a 3D plane in which the 3D point corresponding to a pixel in the reference image lies. This makes photometric values of corresponding regions/patches in two of more images of the same scene more similar through accommodation of perspective projection. In other words, the same 3D plane may not look similar in two images.

Belief Propagation (BP) is a message-passing algorithm, where messages are defined as functions from nodes to their neighbors, so that the message Mt→s(us) represents, in words, “node t’s opinion of the [negative log of the] likelihood that node s has value us” [26, 27]. When implemented as an iterative algorithm, messages are updated according to a schedule, like PatchMatch, and messages on the right-hand side of update equation are those of the previous iteration, or those computed earlier in the current iteration. Messages are typically initialized to all-zero and at convergence is the estimate of the minimizer. Loopy Belief Propagation (LBP) is BP on a graph with loops or non-tree.

REINFORCE (Monte-Carlo Policy Gradient) , there a many good literature/media on Reinforcement Learning [28,29, 30, 31, 32]. REINFORCE is a very popular policy gradient method, also known as Monte Carlo Policy Gradient. The algorithm samples a full trajectory ( hence Monte Carlo) , and then update the policy weights backwards. The algorithm is as follows.

# Policy gradient algorithms search for a local maximum of J(θ) by 
# gradient ascent in parameter,θ, space. 
# ∆θ = α∇θJ(θ)
# ∇θπθ(s,a) = πθ(s,a) * ∇θπθ(s,a)/πθ(s, a) = πθ(s,a)∇θ log πθ(s,a)
# ∇θ log πθ(s, a) is called score function
# ∇θπθ(s,a)/πθ(s, a) is a likelihood ratio# Policy Gradient Theorem 
# ∇θJ(θ) = Eπθ [∇θ log πθ(s, a) Qπθ (s, a)]
# J(θ) -> objective function
# Eπθ -> Expectation over s and a. 
# The policy gradient theorem generalizes the likelihood
# ratio approach to multi-step Markov Decision Processes (MDP) 
# and replaces instantaneous
# reward r with long-term value Qπθ (s, a) or vtdef REINFORCE ():
Initialize θ arbitrarily 
# s-> state, a -> action , r -> reward
  for  _ in episode {s1, a1,r2, …,sT−1, aT−1,rT } : # ∼ πθ 
    for t in range (1, T − 1) :
      # uses the complete return value from time t, vt, as reward
      # samples expectation over state and action (hence stochastic)
      θ = θ + α∇θ log πθ(st , at)*vt 
return θ

References:

PatchMatch-RL: Deep MVS With Pixelwise Depth, Normal, and Visibility, paper: https://openaccess.thecvf.com/content/ICCV2021/papers/Lee_PatchMatch-RL_Deep_MVS_With_Pixelwise_Depth_Normal_and_Visibility_ICCV_2021_paper.pdf, code: https://github.com/leejaeyong7/patchmatch-rl,
Plane-sweep https://www.uio.no/studier/emner/matnat/its/nedlagte-emner/UNIK4690/v16/forelesninger/lecture_8_3_multiple_view_stereo.pdf
The PatchMatch Randomized Matching Algorithm for Image Manipulation, http://www.connellybarnes.com/work/publications/2011_patchmatch_cacm.pdf
http://vis.berkeley.edu/courses/cs294-69-fa11/wiki/images/1/18/05-PatchMatch.pdf
Deep Multi-View Stereo gone wild, https://arxiv.org/pdf/2104.15119.pdf
Pixelwise View Selection for Unstructured Multi-View Stereo, https://demuc.de/papers/schoenberger2016mvs.pdf ; https://github.com/colmap/colmap.
Multi-view stereo, https://slazebni.cs.illinois.edu/spring19/lec19_multiview_stereo.pdf
Deep Learning for Multi-view Stereo via Plane Sweep: A Survey, https://arxiv.org/pdf/2106.15328.pdf
MVSNet: Depth Inference for Unstructured Multi-view Stereo, https://arxiv.org/pdf/1804.02505.pdf
Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference, https://arxiv.org/pdf/1902.10556.pdf
MVSCRF: Learning multi-view stereo with conditional random fields, https://openaccess.thecvf.com/content_ICCV_2019/papers/Xue_MVSCRF_Learning_Multi-View_Stereo_With_Conditional_Random_Fields_ICCV_2019_paper.pdf
PVSNet: Pixelwise Visibility-Aware Multi-View Stereo Network, https://arxiv.org/pdf/2007.07714.pdf
PatchmatchNet: Learned multi-view patchmatch stereo, 2020, https://arxiv.org/abs/2012.01411
PatchMatch Based Joint View Selection and Depthmap Estimation, https://openaccess.thecvf.com/content_cvpr_2014/papers/Zheng_PatchMatch_Based_Joint_2014_CVPR_paper.pdf
Dense Hybrid Recurrent Multi-view Stereo Net with Dynamic Consistency Checking, https://arxiv.org/pdf/2007.10872.pdf
Zero-normalized cross-correlation (ZNCC), https://en.wikipedia.org/wiki/Cross-correlation
Guided Image Filtering, http://kaiminghe.com/eccv10/
A Space-Sweep Approach to True Multi-Image Matching, https://www.ri.cmu.edu/pub_files/pub1/collins_robert_1996_1/collins_robert_1996_1.pdf
Cost Volume Pyramid Based Depth Inference for Multi-View Stereo, https://arxiv.org/pdf/1912.08329.pdf
End-to-End Learning of Geometry and Context for Deep Stereo Regression, https://arxiv.org/pdf/1703.04309.pdf, https://vision.middlebury.edu/stereo/taxonomy-IJCV.pdf
A Taxonomy and Evaluation of Dense Two-Frame
Stereo Correspondence Algorithms, https://vision.middlebury.edu/stereo/taxonomy-IJCV.pdf
Computational vision and regularization theory, file:///home/subrata/Downloads/317314a0.pdf
Awesome 3D reconstruction list, https://github.com/openMVG/awesome_3DReconstruction_list
Multi-view stereo: A tutorial, https://carlos-hernandez.org/papers/fnt_mvs_2015.pdf
Learning Inverse Depth Regression for Multi-View Stereo with Correlation Cost Volume, https://arxiv.org/pdf/1912.11746.pdf
PMBP: PatchMatch Belief Propagation for Correspondence Field Estimation, https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/PMBP.pdf
Introduction to Loopy Belief Propagation, https://cseweb.ucsd.edu/classes/sp06/cse151/lectures/belief-propagation.pdf
http://www.cs.cmu.edu/~rsalakhu/10703/Lectures/Lecture_PG.pdfPolicy Gradient Theorem Explained — Reinforcement Learning, https://www.youtube.com/watch?v=cQfOQcpYRzE
10703 Deep Reinforcement Learning and Control, http://www.cs.cmu.edu/~rsalakhu/10703/Lectures/Lecture_PG.pdf
Policy Gradient Methods for Reinforcement Learning with Function
Approximation, https://homes.cs.washington.edu/~todorov/courses/amath579/reading/PolicyGradient.pdf
Reinforcement Learning: An Introduction, by Richard S. Sutton, ‎Andrew G. Barto, The MIT Press, 2018.
Sample Efficient Reinforcement Learning with REINFORCE, https://web.stanford.edu/~boyd/papers/pdf/conv_reinforce_aaai_preprint_short.pdf

Deep Learning Multiview Stereo (MVS)

Written by Subrata Goswami

No responses yet