Detection of animals from a cluttered scene is not a trivial task. So far, convolutional neural network (CNN) architectures have served this purpose. We introduce stacked convolutional autoencoders (SCAE) for this purpose. It is an unsupervised stratified feature extractor that could be used for high-dimensional input images. We also introduce a hybrid feature extraction technique based on Fisher Vectors (FV) and stacked autoencoders (SAE). SCAE learns significant features utilizing plain stochastic gradient descent and finds a good initialization for CNNs so as to eliminate the various unique local minima of exceptionally non-convex target functions emerging in virtually all deep learning problems. We have proposed a parallel pipeline for both detecting animals in both visible and infrared images. The framework model has achieved 97% accuracy. © 2020, Springer Nature Singapore Pte Ltd.