Large-scale labelled data are required to train deep neural networks in order to obtain good performance in visual learning from images or videos. However, collecting large-scale dataset is expensive. Considering the free but large-scale unlabelled data from web, self-supervised learning has gained much more attention. In the domain of spatio-temporal representation learning, leveraging the temporal resolution in a self-supervised manner is not yet fully explored. In this paper, we proposed a novel spatio-temporal representation learning method, referred to as Cycle Encoding Prediction (CEP), to learn spatio-temporal representation with unlabelled videos. In general, CEP leverages the temporal coherence of current state, past state and future state of the video stream as the self-supervision signal. Corporating with the cycle prediction consistency and the temporal contrastive learning, CEP can effectively learn the spatio-temporal representation by bi-directional prediction. CEP is achieved with a feature encoder and two prediction modules for future and past prediction. We report the CEP on UCF101 dataset and it has achieved the state-of-the-art result for self-supervised learning on the action recognition task.
We propose the first multi-frame video object detection framework trained to detect great apes. It is applicable to challenging camera trap footage in complex jungle environments and extends a traditional feature pyramid architecture by adding self-attention driven feature blending in both the spatial as well as the temporal domain. We demonstrate that this extension can detect distinctive species appearance and motion signatures despite significant partial occlusion. We evaluate the framework using 500 camera trap videos of great apes from the Pan African Programme containing 180K frames, which we manually annotated with accurate per-frame animal bounding boxes. These clips contain significant partial occlusions, challenging lighting, dynamic backgrounds, and natural camouflage effects. We show that our approach performs highly robustly and significantly outperforms frame-based detectors.