The popularity of wearables and their seamless integration into our daily lives have transformed these devices into an appealing resource to deploy automatic fall detection systems. During last years, a massive literature on new methods and algorithms for these wearable detectors has been produced. However, in most cases these algorithms are tested against one single (or at best two) datasets containing signals captured from falls and conventional movements. This work evaluates the behavior of a fall detection system based on a convolutional neural network when different public repositories of movements are alternatively used for training and testing the model. After a systematic cross-dataset evaluation involving four well-known datasets, we show the difficulty of extrapolating the results achieved by a certain classifier for a particular database when another dataset is considered. Results seem to indicate that classification methods tend to overlearn the particular conditions (typology of movements, characteristics of the employed sensor, experimental subjects) under which the training samples were generated.