Visual object recognition is a fundamental challenge for reliable search and rescue (SAR) robots, where vision can be limited by lighting and other harsh environmental conditions in disaster sites. The goal of this paper is to explore the use of thermal and visible light images for automatic object detection in SAR scenes. With this purpose, we have used a new dataset consisting of pairs of thermal infrared (TIR) and visible (RGB) video sequences captured from an all-terrain vehicle moving through several realistic SAR exercises participated by actual first response organisations. Two instances of the open source YOLOv3 convolutional neural network (CNN) architecture are trained from annotated sets of RGB and TIR images, respectively. In particular, frames are labelled with four representative classes in SAR scenes comprising both persons civilian and first-responder) and vehicles (Civilian-car and response-vehicle). Furthermore, we perform a comparative evaluation of these networks that can provide insight for future RGB/TIR fusion.