CVPR 2021 Select Paper Reviews

Reviewed a few papers that were presented orally at CVPR 2021. Tried to capture their essence in the following.


The first paper is Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [1]. The authors propose an end-to-end trainable architecture for video -and-language tasks. To evaluate their architecture they evaluate 2 specific video-and-language tasks — Text-to-Video Retrieval and Video Question and Answer. In Text-to-Video Retrieval, the goal is to retrieve a video segment that represents the input text. In Video QA, the goal is to answer a question, free form or multiple choice on a video segment.

The authors claim that their architecture, ClipBERT, has advantages in memory needs and time to train with superior accuracy. The authors attribute this to their choices of input and architecture. The memory and time advantage comes from using only a few sampled frames, hence sparse data. Frame level processing also lends easily to 2D convolutions rather than 3D convolutions used for videos, hence further reducing resource needs. The accuracy improvement is likely coming from the network being end-to-end trainable.

The following figure shows conceptual block diagram of the network. There is one branch for texts and another one for videos. In contrast to previous works, the authors allow gradients to propagate beyond the text and clip features to the embedding networks. This allows fine tuning of the embedding networks and potentially better accuracy. The authors used ResNet50 for video embedding and BERT-base for text embedding. Each sampled clip is uniformly sampled with T frames. If T >1, a temporal fusion layer(e.g., mean-pooling) aggregates the frame feature maps into a single feature map for the clip. Clips are of 1 second duration extracted from the entire video.

The following code snippets from the author’s github repo shows some of the data flow during pre-training in the model. The method setup_model sets up the entire model by calling a class called ClipBert . ClipBert uses detectron2 ResNet50 and ClipBertForPreTraining . ClipBertForPreTraining in turn uses ClipBertBaseModel and BertPreTrainingHeads.

def setup_model(cfg, device=None)    
model = ClipBert(
model_cfg, input_format=cfg.img_input_format,
class ClipBert(nn.Module):
def __init__(self, config, input_format="BGR",
def forward(self, batch):
# used to make visual feature copies
repeat_counts = batch["n_examples_list"]
del batch["n_examples_list"]
visual_features = self.cnn(batch["visual_inputs"])
batch["visual_inputs"] = repeat_tensor_rows(
visual_features, repeat_counts)
if self.retrieval:
batch["sample_size"] = len(repeat_counts) # batch size
outputs = self.transformer(**batch) # dict
return outputs
class ClipBertForPreTraining(BertPreTrainedModel):
def __init__(self, config):
self.config = config
self.bert = ClipBertBaseModel(config)
self.cls = BertPreTrainingHeads(config)
self.init_weights() def forward (self,
..... outputs = self.bert(
attention_mask=text_input_mask, # (B, Lt) note this mask is text only!!!

sequence_output, pooled_output = outputs[:2]
# Only use the text part (which is the first `Lt` tokens) to save computation,
# this won't cause any issue as cls only has linear layers.
txt_len = text_input_mask.shape[1]
prediction_scores, seq_relationship_score = self.cls(
sequence_output[:, :txt_len], pooled_output)
return <losses>

The following code snippet shows the data flow inside ClipBertBaseModel . ClipBertBaseModel uses classes BertEmbeddings and VisualInputEmbedding for text and video frame embedding. Both of these classes rely on PyTorch nn.Embedding to do the final embedding.

class ClipBertBaseModel(BertPreTrainedModel):
def __init__(self, config):
self.config = config
self.embeddings = BertEmbeddings(config)
self.visual_embeddings = VisualInputEmbedding(config)
self.encoder = BertEncoder(config)
self.pooler = BertPooler(config)
self.init_weights() def forward(self, text_input_ids, visual_inputs, attention_mask):
r"""Modified from BertModel
text_input_ids: (B, Lt)
visual_inputs: (B, #frame, H, W, C)
attention_mask: (B, Lt) with 1 indicates valid, 0 indicates invalid position.
input_shape = text_input_ids.size()
device = text_input_ids.device
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
text_embedding_output = self.embeddings(
input_ids=text_input_ids) # (B, Lt, D)
visual_embedding_output = self.visual_embeddings(
visual_inputs) # (B, Lv, d)
visual_attention_mask = attention_mask.new_ones(
visual_embedding_output.shape[:2]) # (B, Lv)
attention_mask =
[attention_mask, visual_attention_mask], dim=-1) # (B, lt+Lv, d)
embedding_output =
[text_embedding_output, visual_embedding_output],
dim=1) # (B, Lt+Lv, d)
extended_attention_mask: torch.Tensor =\
attention_mask, input_shape, device)
encoder_outputs = self.encoder(
None, self.config.num_hidden_layers) # required input
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output)
outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]
return outputs # sequence_output, pooled_output, (hidden_states), (attentions)
class BertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings.
def __init__(self, config):
self.word_embeddings = nn.Embedding(
config.vocab_size, config.hidden_size,
self.position_embeddings = nn.Embedding(
config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(
config.type_vocab_size, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with
# TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = BertLayerNorm(
config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
class VisualInputEmbedding(nn.Module):
Takes input of both image and video (multi-frame)
def __init__(self, config):
super(VisualInputEmbedding, self).__init__()
self.config = config
# sequence embedding
self.position_embeddings = nn.Embedding(
config.max_position_embeddings, config.hidden_size)
self.row_position_embeddings = nn.Embedding(
self.col_position_embeddings = nn.Embedding(
self.token_type_embeddings = nn.Embedding(1, config.hidden_size)
self.LayerNorm = BertLayerNorm(
config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)

The following code snippets show the data flow inside BertPreTrainingHeads. BertPreTrainingHeads uses BertLMPredictionHead and nn.Linear. BertLMPredictionHead uses BertPredictionHeadTransform , nn.Linear and nn.Parameter. BertPredictionHeadTransform in turn uses nn.Linear , activation functions such gelu, relu, swish or mish, and FusedLayerNorm. FusedLayerNorm is in the package called apex ,which contains NVIDIA-maintained utilities to streamline mixed precision and distributed training.

class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
self.predictions = BertLMPredictionHead(config)
self.seq_relationship = nn.Linear(config.hidden_size, 2)
def forward(self, sequence_output, pooled_output):
prediction_scores = self.predictions(sequence_output)
seq_relationship_score = self.seq_relationship(pooled_output)
return prediction_scores, seq_relationship_score
class BertLMPredictionHead(nn.Module):
def __init__(self, config):
self.transform = BertPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(
config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size)) # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
class BertPredictionHeadTransform(nn.Module):
def __init__(self, config):
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
if isinstance(config.hidden_act, str):
self.transform_act_fn = ACT2FN[config.hidden_act]
self.transform_act_fn = config.hidden_act
self.LayerNorm = BertLayerNorm(
config.hidden_size, eps=config.layer_norm_eps)
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.transform_act_fn(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish, "gelu_new": gelu_new, "mish": mish}BertLayerNorm = LayerNormfrom apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm

The authors do cross-modal pre-training on COCO Captions and Visual Genome Captions data sets. Then they fine-tune the pre-trained model on specific video-text task. The reason for using data sets between pre-training and fine-tuning is lack of enough data for video-text tasks.

The authors did experiments with various number of clips and number of frames per clip versus batch size (proxy for memory usage) and pass time ( proxy for compute). The expected patterns of almost linear dependence of pass time on both are appear to hold true. For batch size, The expected patterns of almost inverse linear dependence on both appear to hold true.

The authors found that larger input resolution improves performance on the video retrieval task, while maintaining a similar performance on the video QA task.

The authors found that adding more frames during inference first improves accuracy but eventually saturates.

The authors found that ClipBERT demonstrates better performance than other state of the art network like HERO, FSE , etc on text to video retrieval tasks. They also observed better performance than QueST, HCRN, etc on video QA tasks. However, the authors did not make estimate of how much resources they used versus the other networks for comparable performance. Given that the authors are using sparse clips and frames, the resource needs expected to be less.

Learning to Track Instances without Video Annotations

The second paper [2] titled as above is on multiple instance tracking in videos. The authors architecture and methods only uses labeled images, instead of labeled videos. The authors use an instance segmentation network called Segmenting Objects by Location (SOLO) [3]. Instance segmentation requires correct separation of all objects in image while also semantically segmenting each pixel.

The authors of SOLO [3] broadly classified two types of instance segmentation networks — top down and bottom up. In top-down approach, objects are detected first with a well established detector (e.g. Mask-RCNN) and then segmenting each bounding box. In the bottom-up approach, each pixel is assigned an embedding vector. Pixels closer in the embedding vector space are considered belonging to the same instance.

The following figure (from [3]) shows the high level block diagram of SOLO. SOLO first passes the image through a convolution network such as FPN. Then passes the features to generate SxS grids for semantic category and instance category — called Category Branch and Mask Branch respectively. The output from the branches are SxSxC for category ( assuming there are C categories) and HxWxS² for instance pixel mask. The head networks are 4 to 7 layer deep of convolution or interpolation.

The authors added two embedding networks for tracking in parallel with SOLO’s two branches. One head for image embedding and the other for video embedding, shown in the following figure. The Image Embedding Head is similar to SOLO’s Category Branch. In the Image Embedding branch, the authors then introduced Instance Contrastive loss along with Maximum Entropy regularization to learn only with labeled images an embedding feature that is capable of tracking.

In Constrastive Learning, positive sample is generated by data augmentations and the negative samples are random samples from the mini-batch. The authors found that the commonly used InfoNCE loss [12] for each instance in an image did not perform well due to highly long-tailed distribution of instances with respect to their number of pixels — smaller instances are insufficiently trained due to lesser positive samples. Hence the authors devised a loss function called Center-Contra Loss.

In the following couple of paragraphs, digress into NCE, InfoNCE, Mutual Information, etc. Noise-constrastive Estimation (NCE) [13], estimates the pdf of an unknown distribution ( pm). It takes the log of the normalizing denominator , which is a constant (c ), as a parameter to be estimated (e.g. ln pm(.; θ) = ln p0m(.; α) + c ) . The estimation is done by mixing samples from a noise distribution with the samples of the data to construct an artificial logistic regression problem. In the regression, the noise samples have label 0 and data samples with unknown pdf have label 1. Mutual Information [see 14, 15 for comprehensive overview] , commonly quantified by the the following equation, where p(x,c) is the joint probability, p(x|c) is the conditional probability and p(x).

Mutual Information between a variable x and context c.

Mutual Information can be viewed as a sort of generalized correlation metric between random variables. InfoNCE [12] uses the Mutual Information ratio from the above equation. It first projects “the target x (future) and context c (present) into a compact distributed vector representations in a way that maximally preserves the mutual information of the original signals x and c”.

Coming back to the paper in review. The following figure lists the salient equations of loss term used in the Image Embedding head. fq represents the feature vector in the Q th grid cell per SOLO. Equation 1 in the following figure is the InfoNCE contrastive loss. Where fq and fp are features of same instance i, and fk is feature of another instance. Ωi is the set of set of all grid cell for instance i. Ci is the center representation of an instance i, obtained by averaging all the features of the cells of the same instance. S(i,j) is the similarity matrix. The contrast term is a cross-entropy loss that pushes the center representation of each instance further apart — implemented by encouraging the elements on the diagonal of the matrix, 𝑆(i,i), to be larger than the other off-diagonal elements 𝑆(𝑖, 𝑗) , ∀𝑗 ≠ 𝑖. The eventual Instance Contrastive Loss (IC) is defined in Equation 5 below, where K is the number of instances and I is the identity matrix.

Figure : Instance Contrastive loss equations

The authors added Maximum Entropy Regularization to account for newly appearing objects, where entropy between a new object and already known objects should be uniform or not be similar to an existing object. This is achieved by increasing the entropy measured for the similarity between the center embedding of each instance and all other instances. The entropy, H, is defined by the following equation.

To further increase performance, the authors leveraged space-time correspondence [4] on unlabeled videos. The idea of correspondence is to compute through frames (or features ) forward and then the same frames backward like palindromes. Pixels in one frame correspond to pixels in another location in a different frame due to movements. As the starting and ending frames are the same, and hence pixels in them should be the same. The transition probability of a pixel (or feature) from one frame to the next frame is indicated by affinity matrix element 𝐴(𝑡+1 𝑡, 𝑖, 𝑗) denotes probability of instance i at time t becoming instance j at time t+1 . The following equations show the application of affinity matrices over the full forward and backward traversal and the cross entropy based loss function defined for optimization.

Loss equation for video correspondence

The author’s method achieved very comparable or slightly better performance than STOA methods MaskTrack R-CNN and SipMask on the YouTube-VIS validatioan set. Performance of both MaskTrack R-CNN and SipMask are higher after post-processing compared to authors’. Post processing combines the initial prediction results with: detection confidence, bounding box IoU, category consistency, and similarity scores, etc. to increase average precision (AP).

Energy Based Scene Graph:

The third paper [5] is on scene graph generation. Scene graph is a graph representation of an image that encodes objects along with their relationships. The authors represent scene graph as a tuple of 2 tensors — object (O) and relations (R). O is nxd and R is nxnxd’ . Where n is the number of object, d and d’ are the total number of possible object and relation labels(classes). Scene graph generation generally is a two stage network. The first stage detects objects and their bounding boxes. The second stage takes the bounding boxes and labels and refines them with contextual information and bounding box unions to obtain final object labels ,O and relationship labels, R. In the current models, each of object and relationship is considered in isolation when computing individual losses, which are then summed to obtain the loss for the given image. Such a loss formulation ignores the fact that objects and relations in a scene graph are interdependent. The authors proposes an Energy Based Method (EBM) that explicitly takes into account structure in the output space. The output space is the object and relation labels. In addition, the authors use a graph representation of images, rather than using whole image encoding. The authors reason that this approach handles small objects and variable number of objects better. The image graph uses the same adjacency matrix ( between nodes of the objects ) as the scene graph. The authors also introduce Edge Graph Neural Network (EGNN) , which is a variant of GNN where messages from edges in addition to nodes are also aggregated as the following equations.

EGNN node and edge state updates.

Papers [6, 7] go over EBMs and their uses at length. The following set of equations capture how a energy based loss function can be calculated ( see [6] section 2.2.4 , equations 19–25 for more details ).

p(θ,x,y) = exp(−E(θ,x,y))/Z(θ)
Z(θ) = Integral (x, exp(−E(θ,x, y)) )
log(Z(θ)) = log (Integral ( x, −E(θ,x, y) ))
log (p(θ, x, y))) = -Z(θ) − E(θ,x, y)
log (p(θ, x, y))) = -log (Integral ( x, −E(θ,x, y) )) − E(θ,x, y)
∇ (θ, log (p(θ, x, y))) = Integral (x, ∇(θ, E(θ,x, y))*exp(−E(θ,x, y)) )/Z(θ) − ∇(θ, E(θ,x, y))
∇ (θ, log (p(θ, x, y))) = Expectation (pθ(x’ ,y),∇(θ, E(θ, x’, y)) − ∇(θ, E(θ,x, y))

The first equation above is the Gibbs distribution expression that converts energy into probability space. The energy function is E(θ,x,y), with θ as parameter, x as data ,and y as label. The denominator Z(θ) is the normalizing or partition function. It is not possible to compute this integral always. EBMs do not consider the normalizing function , and hence can only be used for discrimination or decision tasks where only the relative values of outputs are considered. The gradient operator, , is used for optimizing the neural networks through back propagation. The authors approximate the expectation through Markov Chain Monte Carlo (MCMC) [11] sampling to get approximate minimum energy. The loss is the difference between this minimum energy and the ground truth energy as shown in the following figure.

The final loss term used has two more terms for regularization.

The following figure shows the overall architecture of the authors’ network.

The training code starts in the file and is divided broadly in to two parts — base_model and energy_model. The base_model is a MaskRCNN (called GeneralizedRCNN ) with several ROI heads following a configurable backbone ( VGG, ResNet, FPN, RetinaNet , etc.) and a RPN. A number of ROI heads can be configured — box, mask, keypoint, relation, and attribute in build_roi_heads. The ROI head relation is the scene graph generator. The method detection2graph takes the output (task_loss_dict, detections, roi_features) from the base_model and creates the image and scene graphs. The roi_features becomes the node states of of the image graph, with the adj_matrix from scene graph defining the edges. A similar function, gt2graph, takes the ground truth labels and generates the corresponding image and scene graph for ground truths. Both graphs use the implementation Graph (see [11] for a good intro to GNN, specially lecture 6). The output from detection2graph and gt2graph goes to the sampler, which is a Stochastic Gradient Langevin Dynamic optimizer implemented in the class SGLD and uses MCMC. The sampler predicts a scene graph. That along with ground truth scene graph goes into the energy_model. The energy_model is implemented in the GraphEnergyModel class. It takes the image graph and the scene graph as input and passes them through individual embeddings, Graph Neural Networks and pooing layers before getting mixed in a MLP consisting of Linear, ReLU and Linear layers. The energy_model graph networks are Edged Graph Neural Network (EGNN) and Graph Neural Network (GNN) for scene and image respectively. EGNN and GNN are implemented in the classes EGNNLayer and GNNLayer. The forward method applies message passing ( matrix multiplication of adjacency matrix and node states), kernel function ( linear layer, ReLU ) and GRUCell update. The energy_model outputs are losses called positive_energy and negative_energy for ground truth and prediction respectively. From these two energies, the loss is computed in loss_function , and stepped through individual optimizers base_optimizer and energy_optimizer for base_model and energy_model respectively. Both base_optimizer and energy_optimizer are torch.optim.SGD .

The authors use 2 data sets , Visual Genome [8] and GQA [9]. Three metrics were used — Predicate Classification (PredCls), Scene Graph Classification (SGCls), and Scene Graph Detection (SGDet). In PredCls, relationship prediction is made based on object bounding-boxes and labels. In SGCLs, the object and relationship labels are predicted based on bounding boxes. In SGDet, the scene graph is predicated from image.

The author’s EBM method consistently outperform the cross-entropy based methods by up to 21% on Visual Genome (VG)and 27% on GQA. GQA is derived from VG, but has denser graphs with a larger number of object and relation categories. Authors experimented with 4 scene graph generators — VCTree, Motif, IMP and Transformers. Transformers was used only in the GQA data set as high memory requirement in VCTree precluded its use.


  1. Learning to Track Instances without Video Annotations. Arxiv:
  2. SOLO: Segmenting Objects by Locations. Arxiv:
  3. Space-time correspondence as a contrastive random walk. Arxiv:, github (code not available as of this writting) .
  4. Energy-Based Learning for Scene Graph Generation, Arxiv:, github: mods333/energy-based-scene-graph.
  5. A Tutorial on Energy-Based Learning,
  6. Loss Functions for Discriminative Training of Energy-Based Models,
  7. Visual Genome,
  8. GQA,
  9. MCMC , , ,
  10. CS224W: Machine Learning with Graphs,
  11. The InfoNCE (Noise Constrastive Estimation) loss in self-supervised learning, Representation Learning with Contrastive Predictive Coding, .
  12. Noise-contrastive estimation: A new estimation principle for
    unnormalized statistical models, .
  13. A Unified Definition of Mutual Information with Applications in Machine Learning,
  15. What is Candidate Sampling,