Reviewed a few papers that were presented orally at CVPR 2021. Tried to capture their essence in the following.


The first paper is Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [1]. The authors propose an end-to-end trainable architecture for video -and-language tasks. To evaluate their architecture they evaluate 2 specific video-and-language tasks — Text-to-Video Retrieval and Video Question and Answer. In Text-to-Video Retrieval, the goal is to retrieve a video segment that represents the input text. In Video QA, the goal is to answer a question, free form or multiple choice on a video segment.


Subrata Goswami

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store