Reviewed a few papers that were presented orally at CVPR 2021. Tried to capture their essence in the following.


The first paper is Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [1]. The authors propose an end-to-end trainable architecture for video -and-language tasks. To evaluate their architecture they evaluate 2 specific video-and-language tasks — Text-to-Video Retrieval and Video Question and Answer. In Text-to-Video Retrieval, the goal is to retrieve a video segment that represents the input text. In Video QA, the goal is to answer a question, free form or multiple choice on a video segment.


