Object detection in video is an essential computer vision task. Consequently, many efforts have been devoted to developing precise and fast deep-learning models for this task. These models are commonly deployed on discrete and powerful GPU devices to meet both frame rate performance and detection accuracy requirements. Furthermore, model training is usually performed in a strongly supervised way so that samples must be previously labelled by humans using a slow and costly process. In this paper, we develop a real-time implementation for unsupervised object detection in video employing a low-power device. We improve typical approaches for object detection using information supplied by optical flow to detect moving objects. Besides, we use an unsupervised clustering algorithm to group similar detections that avoid manual object labelling. Finally, we propose a methodology to optimize the deployment of our resulting framework on an embedded heterogeneous platform. Thus, we illustrate how all the computational resources of a Jetson AGX Xavier (CPU, GPU, and DLAs) can be used to fulfil frame rate, accuracy, and energy consumption requirements. Three different data representations (FP32, FP16 and INT8) are studied for the pipeline networks in order to evaluate the impact of all of them in our pipeline. Obtained results show that our proposed optimizations can improve up to 23.6x
energy consumption and 32.2x
execution time with respect to the non-optimized pipeline without penalizing the original mAP (59.44). This computational complexity reduction is achieved through knowledge distillation, using FP16 data precision, and deploying concurrent tasks in different computing units.